Learn how to use PySpark in under 5 minutes (Installation + Tutorial)Georgios DrakosBlockedUnblockFollowFollowingMay 13I’ve found that is a little difficult to get started with Apache Spark (this will focus on PySpark) and install it on local machines for most people.
With this simple tutorial you’ll get there really fast!Apache Spark is a must for Big data’s lovers as it is a fast, easy-to-use general engine for big data processing with built-in modules for streaming, SQL, machine learning and graph processing.
This technology is an in-demand skill for data engineers, but also data scientists can benefit from learning Spark when doing Exploratory Data Analysis (EDA), feature extraction and, of course, ML.
But please remember that Spark is only truly realized when it is run on a cluster with a large number of nodes.
Table of ContentsIntroductionSpark definitionSpark ApplicationInstall PySpark on MacOpen Jupyter Notebook with PySparkLaunching a SparkSessionConclusionReferences-IntroductionApache Spark is one of the hottest and largest open source project in data processing framework with rich high-level APIs for the programming languages like Scala, Python, Java and R.
It realizes the potential of bringing together both Big Data and machine learning.
This is because:Spark is fast (up to 100x faster than traditional Hadoop MapReduce) due to in-memory operation.
It offers robust, distributed, fault-tolerant data objects (called RDDs)It integrates beautifully with the world of machine learning and graph analytics through supplementary packages like MLlib and GraphX.
Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language.
However, for most beginners, Scala is not a great first language to learn when venturing into the world of data science.
Fortunately, Spark provides a wonderful Python API called PySpark.
This allows Python programmers to interface with the Spark framework — letting you manipulate data at scale and work with objects over a distributed file system.
So, Spark is not a new programming language that you have to learn but a framework working on top of HDFS.
This presents new concepts like nodes, lazy evaluation, and the transformation-action (or ‘map and reduce’) paradigm of programming.
In fact, Spark is versatile enough to work with other file systems than Hadoop — like Amazon S3 or Databricks (DBFS).
Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes.
Spark DefinitionTypically when you think of a computer you think about one machine sitting on your desk at home or at work.
This machine works perfectly well for applying machine learning on small dataset .
However, when you have huge dataset(in tera bytes or giga bytes), there are some things that your computer is not powerful enough to perform.
One particularly challenging area is data processing.
Single machines do not have enough power and resources to perform computations on huge amounts of information (or you may have to wait for the computation to finish).
A cluster, or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one.
Now a group of machines alone is not powerful, you need a framework to coordinate work across them.
Spark is a tool for just that, managing and coordinating the execution of tasks on data across a cluster of computers.
Spark ApplicationA Spark Application consists of:DriverExecutors (set of distributed worker processes)DriverThe Driver runs the main() method of our application having the following duties:Runs on a node in our cluster, or on a client, and schedules the job execution with a cluster managerResponds to user’s program or inputAnalyzes, schedules, and distributes work across the executorsExecutorsAn executor is a distributed process responsible for the execution of tasks.
Each Spark Application has its own set of executors, which stay alive for the life cycle of a single Spark application.
Executors perform all data processing of a Spark jobStores results in memory, only persisting to disk when specifically instructed by the driver programReturns results to the driver once they have been completedEach node can have anywhere from 1 executor per node to 1 executor per core** Node is single entity machine or server .
Spark’s Application WorkflowWhen you submit a job to Spark for processing, there is a lot that goes on behind the scenes.
Our Standalone Application is kicked off, and initializes its SparkContext.
Only after having a SparkContext can an app be referred to as a DriverOur Driver program asks the Cluster Manager for resources to launch its executorsThe Cluster Manager launches the executorsOur Driver runs our actual Spark codeExecutors run tasks and send their results back to the driverSparkContext is stopped and all executors are shut down, returning resources back to the clusterInstall Spark on Mac (locally)First Step: Install BrewYou will need to install brew if you have it already skip this step:open terminal on your mac.
You can go to spotlight and type terminal to find it easily (alternative you can find it on /Applications/Utilities/).
Enter the command bellow.
$ /usr/bin/ruby -e "$(curl -fsSL https://raw.
Hit Return and the script will run.
It will output to your terminal a log of what is going to install.
Hit Return to continue or any other key to abort.
It might ask for sudo privileges.
If this happens you will have to type your admin password and hit Return again.
Notes: Command line tools (Apple’s XCode) will be installed after this guide.
The installation will look like as the image below.
When the installation finishes successfully it will look as the image below.
By default Homebrew is sending anonymous data and analytics.
You can find additional information here.
You can choose to opt-out by running the command.
$ brew analytics offSecond Step: Install AnacondaIn the same terminal just simple type: $ brew cask install anaconda.
Please see resources section in case you face any issue in that step.
Third final Step: Install PySparkona terminal type $ brew install apache-sparkif you see this error message, enter $ brew cask install caskroom/versions/java8 to install Java8, you will not see this error if you have it already installed.
check if pyspark is properly install by typing on the terminal $ pyspark.
If you see the below it means that it has been installed properly:Open Jupyter Notebook with PySpark ReadyThis section assumes that PySpark has been installed properly and no error appear when typing on a terminal $ pyspark.
At this step, I present the steps you have to follow in order create Jupyter Notebooks automatically initialised with SparkContext.
In order to create a global profile for your terminal session, you will need to create or modify your .
bash_profile or .
Here, I will use .
bash_profile as my exampleCheck if you have .
bash_profile in your system $ ls -a, if you don't have one, create one using $ touch ~/.
bash_profileFind Spark path by running $ brew info apache-spark3.
If you already have a .
bash_profile, open it by $ vim ~/.
bash_profile, press I in order to insert, and paste the following codes in any location (DO NOT delete anything in your file):export SPARK_PATH=(path found above by running brew info apache-spark)export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook"#For python 3, You have to add the line below or you will get an error#export PYSPARK_PYTHON=python3alias snotebook='$SPARK_PATH/bin/pyspark –master local'4.
Press ESC to exit insert mode, enter :wq to exit VIM.
You could fine more VIM commands here.
Refresh terminal profile by $ source ~/.
bash_profileMy favourite way to use PySpark in a Jupyter Notebook is by installing findSparkpackage which allow me to make a Spark Context available in my code.
findSpark package is not specific to Jupyter Notebook, you can use this trick in your favorite IDE too.
Install findspark by running the following command on a terminal$ pip install findsparkLaunch a regular Jupyter Notebook and run the following command:# useful to have this code snippet to avoid getting an error in case forgeting # to close sparktry: spark.
stop()except: pass# Using findspark to find automatically the spark folderimport findsparkfindspark.
init()# import python librariesimport random# initializefrom pyspark.
sql import SparkSession spark = SparkSession.
getOrCreate()num_samples = 100000000def inside(p): x, y = random.
random() return x*x + y*y < 1count = spark.
count()pi = 4 * count / num_samplesprint(pi)The output should be:Please note that with Spark 2.
2 a lot of people recommend just to simply do pip install pyspark .
I try using pip to install pyspark but I couldn’t get the pysparkcluster to get started properly.
Reading several answers on Stack Overflow and the official documentation, I came across this:The Python packaging for Spark is not intended to replace all of the other use cases.
This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) — but does not contain the tools required to setup your own standalone Spark cluster.
You can download the full version of Spark from the Apache Spark downloads page.
Therefore, I would suggest to follow the steps described above.
Launching a SparkSessionWell, it’s the main entry point for Spark functionality: it represents the connection to a Spark cluster and you can use it to create RDDs and to broadcast variables on that cluster.
When you’re working with Spark, everything starts and ends with this SparkSession.
Note that SparkSession is a new feature of Spark 2.
0 which minimize the number of concepts to remember or construct.
(before Spark 2.
0, the three main connection objects were SparkContext, SqlContext and HiveContext).
In interactive environments, a SparkSession will already be created for you in a variable named spark.
For consistency, you should use this name when you create one in your own application.
You can create a new SparkSession through a Builder pattern which uses a “fluent interface” style of coding to build a new object by chaining methods together.
Spark properties can be passed in, as shown in these examples:from pyspark.
sql import SparkSession spark = SparkSession .
cores", 1) .
getOrCreate()At the end of your application, please remember to call spark.
stop() in order to end the SparkSession.
Let's understand the various settings that we define above:master: Sets the Spark master URL to connect to, such as “local” to run locally, “local” to run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone cluster.
config:Sets a config option by specifying a (key, value) pair.
appName: Sets a name for the application, if no name is set, a randomly generated name will be used.
getOrCreate:Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.
In case an existing SparkSession is returned, the config options specified in this builder affecting the SQLContext configuration will applied.
As SparkContext configuration cannot be modified on runtime (you have to stop existing context first) whileSQLContext configuration can be modified on runtime.
ConclusionSpark has seen immense growth over the past several years.
Hundreds of contributors working collectively have made Spark an amazing piece of technology powering the de facto standard for big data processing and data sciences across all industries.
But please remember to use it for manipulations of huge dataset when facing performance issues otherwise it may have opposite effects.
For small datasets (few gigabytes) it is advisable instead to use Pandas.
Thanks for reading and I am looking forward to hear your questions :)Stay tuned and Happy Machine Learning.
com/library/view/learning-spark/9781449359034/Originally published at https://gdcoder.
com on May 13, 2019.