A Neanderthal’s Guide to Apache Spark in PythonTutorial on Getting Started with PySpark for Complete BeginnersEvan HeitmanBlockedUnblockFollowFollowingJun 14So You’ve Heard about Apache SparkIf you’re anything like me, you heard about a fancy-sounding technology called Spark and wanted to test your coding mettle to see if you could add another tool to your data-science toolkit.
Hopefully you’re not exactly like me because in my case I promptly hit an installation wall, and then a terminology wall, and then a conceptual wall, and four hours later, I hadn’t written a single line of code.
And so, after hours of scouring the internet, and more so-called “beginner’s guides” than I care to mention, I decided to write a “neanderthal’s guide” to hopefully spare you some of the hassle that I endured.
Why read this Guide?Even a quick search online for learning material on Spark will leave you swimming in documentation, online courses (many of which are not cheap), and a menagerie of other resources.
From my experience, the majority of these either assumed I knew too much about distributed computing (like assuming I knew what distributed computing meant for example), or they gave high-level or background information without helping me understand how to actually implement anything in Spark.
With that in mind, in this guide I try to do my best to either explain a concept, or direct you somewhere else with an explanation, all with the goal of getting you writing Spark code as quickly as possible.
Because I try to do this for as many topics pertaining to Spark as I can, feel free to jump around if you already have a decent grasp of a particular topic.
I’ll also try to leave you with links to resources that I found helpful as I was diving into Spark.
This is the structure of the guide: I start by explaining some key terminology (i.
jargon) and concepts so we can be on the same page for the rest of the material and also to lower the barrier to entry to the external resources on Spark you’ll find here and elsewhere.
Next, I walk through getting a working version of Spark running on your machine using Google Colab.
And finally I go through a use case to demonstrate how PySpark is actually implemented and what a first pass through an example problem looks like.
What You’ll LearnEnough terminology and concepts to be able to read other Spark resources without being perpetually confusedA relatively painless way to get PySpark running on your computerHow to get started with data exploration in PySparkBuilding and evaluating a basic linear regression model in PySparkHelpful external resources for the majority of material covered hereKey Terminology and ConceptsHere is a list of various terms and concepts that will be helpful to know as you delve into the world of Spark.
What is SparkIf you’ve Googled “what is Spark”, there’s a chance you’ve run into the following description, or something like it: “Spark is a general-purpose distributed data processing engine”.
Without a background in Spark or any familiarity with what those terms mean, that definition is rather unhelpful.
So let’s break it down:Distributed Data/Distributed Computing — Apache Spark operates in a world that is slightly different from run-of-the-mill computer science.
When datasets get too big, or when new data comes in too fast, it can become too much for a single computer to handle.
This is where distributed computing comes in.
Instead of trying to process a huge dataset or run super computationally-expensive programs on one computer, these tasks can be divided between multiple computers that communicate with each other to produce an output.
This technology has some serious benefits, but allocating processing tasks across multiple computers has its own set of challenges and can’t be structured the same way as normal processing.
When Spark says it has to do with distributed data, this means that it is designed to deal with very large datasets and to process them on a distributed computing system.
NOTE: In a distributed computing system, each individual computer is called a node and the collection of all of them is called a cluster(Comic from xkcd)Further Reading — Introduction to distributed computing (8 min read)Processing Engine/Processing Framework — A processing engine, sometimes called a processing framework, is responsible for performing data processing tasks (an illuminating explanation, I know).
A comparison is probably the best way to understand this.
Apache Hadoop is an open source software platform that also deals with “Big Data” and distributed computing.
Hadoop has a processing engine, distinct from Spark, called MapReduce.
MapReduce has its own particular way of optimizing tasks to be processed on multiple nodes and Spark has a different way.
One of Sparks strengths is that it is a processing engine that can be used on its own, or used in place of Hadoop MapReduce, taking advantage of the other features of Hadoop.
(Image from Brad Anderson)Further Reading — Processing Engines explained and compared (~10 min read)General-Purpose — One of the main advantages of Spark is how flexible it is, and how many application domains it has.
It supports Scala, Python, Java, R, and SQL.
It has a dedicated SQL module, it is able to process streamed data in real-time, and it has both a machine learning library and graph computation engine built on top of it.
All these reasons contribute to why Spark has become one of the most popular processing engines in the realm of Big Data.
Spark Functionality (from Databricks.
com)Further Reading — 5 minute guide to understanding the significance of Spark (probably more like ~10 min read)Distributed Computing TermsPartitioned Data — When working with a computer cluster, you can’t just throw in a vanilla dataframe and expect it to know what to do.
Because the processing tasks will be divided across multiple nodes, the data also has to be able to be divided across multiple nodes.
Partitioned data refers to data that has been optimized to be able to be processed on multiple nodes.
Further Reading — Explanation of Data Partitioning (2 min read)Fault Tolerance — In short, fault tolerance refers to a distributed system’s ability to continue working properly even when a failure occurs.
A failure could be a node bursting into flames for example, or just a communication breakdown between nodes.
Fault tolerance in Spark revolves around Spark’s RDDs (which will be discussed later).
Basically, the way data storage is handled in Spark allows Spark programs to function properly despite occurences of failure.
Further Reading — How is Spark fault tolerant (~1 min read)Lazy Evaluation — Lazy evaluation, or lazy computing, has to do with how code is compiled.
When a compiler that is not lazy (which is called strict evaluation) compiles code, it sequentially evaluates each expression it comes across.
A lazy compiler on the other hand, doesn’t continually evaluate expressions, but rather, waits until it is actually told to generate a result, and then performs all the evaluation all at once.
So as it compiles code, it keeps track of everything it will eventually have to evaluate (in Spark this kind of evaluation log, so to speak, is called a lineage graph), and then whenever it is prompted to return something, it performs evaluations according to what it has in its evaluation log.
This is useful because it makes programs more efficient as the compiler doesn’t have to evaluate anything that isn’t actually used.
Further Reading — What is Lazy Evaluation (4 min read)Spark TermsRDDs, DataFrames, DataSets, Oh My! — Spark RDDs (Resilient Distributed Datasets) are data structures that are the core building blocks of Spark.
A RDD is an immutable, partitioned collection of records, which means that it can hold values, tuples, or other objects, these records are partitioned so as to be processed on a distributed system, and that once an RDD has been made, it is impossible to alter it.
That basically sums up its acronym: they are resilient due to their immutability and lineage graphs (which will be discussed shortly), they can be distributed due to their partitions, and they are datasets because, well, they hold data.
A crucial thing to note is that RDDs do not have a schema, which means that they do not have a columnar structure.
Records are just recorded row-by-row, and are displayed similar to a list.
Enter Spark DataFrames.
Not to be confused with Pandas DataFrames, as they are distinct, Spark DataFrame have all of the features of RDDs but also have a schema.
This will make them our data structure of choice for getting started with PySpark.
Spark has another data structure, Spark DataSets.
These are similar to DataFrames but are strongly-typed, meaning that the type is specified upon the creation of the DataSet and is not inferred from the type of records stored in it.
This means DataSets are not used in PySpark because Python is a dynamically-typed language.
For the rest of these explanations I’ll be referring to RDDs but know that what is true for an RDD is also true for a DataFrame, DataFrames are just organized into a columnar structure.
Further Reading — RDDs, DataFrames, & DataSets compared (~5 min read)Further Reading — Pandas v.
Spark DataFrames (4 min read)Further Reading — Helpful RDD Documentation (~5 min read)Transformations — Transformations are one of the things you can do to an RDD in Spark.
They are lazy operations that create one or more new RDDs.
It’s important to note that Transformations create new RDDs because, remember, RDDs are immutable so they can’t be altered in any way once they’ve been created.
So, in essence, Transformations take an RDD as an input and perform some function on them based on what Transformation is being called, and outputs one or more RDDs.
Recalling the section on lazy evaluation, as a compiler comes across each Transformation, it doesn’t actually build any new RDDs, but rather constructs a chain of hypothetical RDDs that would result from those Transformations which will only be evaluated once an Action is called.
This chain of hypothetical, or “child”, RDDs, all connected logically back to the original “parent” RDD, is what a lineage graph is.
Further Reading — Helpful Transformation Documentation (~2 min read) Further Reading — More in-depth Documentation (5–10 min read; Transformations in first half)Actions — An Action is any RDD operation that does not produce an RDD as an output.
Some examples of common Actions are doing a count of the data, or finding the max or min, or returning the first element of an RDD, etc.
As was mentioned before, an Action is the cue to the compiler to evaluate the lineage graph and return the value specified by the Action.
Further Reading — Helpful Action Documentation (~1 min read)Further Reading — More in-depth Documentation (~5 min read; Actions in second half)Lineage Graph — Most of what a lineage graph is was described in the Transformations and Actions sections, but to summarize, a lineage graph outlines what is called a “logical execution plan”.
What that means is that the compiler begins with the earliest RDDs that aren’t dependent on any other RDDs, and follows a logical chain of Transformations until it ends with the RDD that an Action is called on.
This feature is primarily what drives Spark’s fault tolerance.
If a node fails for some reason, all the information about what that node was supposed to be doing is stored in the lineage graph, which can be replicated elsewhere.
Visualization of example lineage graph; r00, r01 are parent RDDs, r20 is final RDD (from Jacek Laskowski )Further Reading — Helpful Lineage Documentation (~2 min read)Spark Applications and Jobs — There is a lot of nitty gritty when it comes to how a processing engine like Spark actually executes processing tasks on a distributed system.
The following is just as much as you’ll need to know in order to have a working understanding of what certain snippets of Spark code do.
In Spark, when an item of processing has to be done, there is a “driver” process that is in charge of taking the user’s code and converting it into a set of multiple tasks.
There are also “executor” processes, each operating on a separate node in the cluster, that are in charge of running the tasks, as delegated by the driver.
Each driver process has a set of executors that it has access to in order to run tasks.
A Spark application is a user built program that consists of a driver and that driver’s associated executors.
A Spark job is task or set of tasks to be executed with executor processes, as directed by the driver.
A job is triggered by the calling of an RDD Action.
This stuff can be rather confusing, so don’t sweat it if it doesn’t make total sense at first, it’s just helpful to be familiar with these terms when they are implemented in code later on.
I’ve included extra resources on this topic if you want more information.
Visualization of Spark Architecture (from Spark API)Further Reading — Cluster Mode Overview from Spark API (~3 min read)Further Reading — Helpful Answer on StackOverflow (~2 min read)Further Reading — Spark Application Overview on Cloudera (~2 min read)Phew you made it through all the terminology and concepts!.Now let’s get into implementation!Installing SparkThat heading might be a bit of a misnomer, because, strictly speaking, this guide won’t show you how to install Apache Spark.
Installing Spark can be a pain in the butt.
For one, writing Spark applications can be done in multiple languages and each one is installed slightly differently.
The underlying API for Spark is written in Scala but PySpark is an overlying API for implementation in Python.
For data science applications, using PySpark and Python is widely recommended over Scala, because it is relatively easier to implement.
And so instead of installing PySpark, this guide will show you how to run it in Google Colab.
Google ColabWhen I was trying to get PySpark running on my computer, I kept getting conflicting instructions on where to download it from (it can be downloaded from spark.
org or pip installed for example), what to run it in (it can be run in Jupyter Notebooks or in the native pyspark shell in the command line), and there were numerous obscure bash commands sprinkled throughout.
As a data scientist, my reaction to bash commands that aren’t pip installs is generally a mix of disgust and despair, and so I turned to Google Colab.
Google Colab is a really powerful interactive python notebook (.
ipynb) tool that has a lot data science libraries pre-installed.
For more information on what it is and how to run it check out this super helpful article (8 min read).
Once you’ve got a Colab notebook up, to get Spark running you have to run the following block of code (I know it’s not my fault, but I apologize for how ugly it is).
!apt-get install openjdk-8-jdk-headless -qq > /dev/null!wget -q https://www-us.
tgz!tar xf spark-2.
tgz!pip install -q findsparkimport osos.
environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"os.
environ["SPARK_HOME"] = "/content/spark-2.
init()NOTE: When I first ran this block of code it did not run.
It was because there had been a new version of Spark released since the code I found was written and I was trying to access an older version of Spark that couldn’t be found.
So if the above code doesn’t run, double check this website to see what the latest version of Spark is and replace everywhere you see “2.
3” in the above snippet to whatever the newest version is.
Basically what this block of code does is download the right versions of Java (Spark uses some Java) and Spark, set the PATH to those versions, and to initialize Spark in your notebook.
If you want to use Spark on another platform besides Colab, here are the most helpful guides I found (in order of helpfulness), hopefully one of them is able to get you going:Installation Resource— Getting Started with PySpark and JupyterInstallation Resource — How to use PySpark on your computerInstallation Resource — How to install PySpark locallyInstallation Resource — How to Get Started with PySparkCoding in PySparkBecause we want to be working with columnar data, we’ll be using DataFrames which are a part of Spark SQL.
NOTE: To avoid possible confusion, despite the fact that we will be working with Spark SQL, none of this will be SQL code.
You can write SQL queries when working with Spark DataFrames but you don’t have to.
Configuring a SparkSessionThe entry point to using Spark SQL is an object called SparkSession.
It initiates a Spark Application which all the code for that Session will run on.
sql import SparkSessionspark = SparkSession.
getOrCreate()NOTE: the “ ” character is this context is called a continuation character which is just a helpful wrapping tool for making long lines of code more readable.
builder — gives access to Builder API which is used to configure the session .
master() — determines where the program will run; "local[*]" sets it to run locally on all cores but you can use "local" to run on one core for example.
In this case, our programs will be run on Google’s servers.
appName() — optional method to name the Spark Application.
getOrCreate() — gets an existing SparkSession or creates new one if none existsCheck the Builder API for more options when building a SparkSession.
Loading DataTo open a local file on Google Colab you need to run the following code which will prompt you to select a file from your computer:from google.
colab import filesfiles.
upload()For this guide we’ll be working with a dataset on video game sales from Kaggle.
It can be found here.
Now load our data into a Spark DataFrame using the .
csv() function: (I shortened the file name for brevity’s sake)data = spark.
csv',inferSchema=True, header=True)NOTE: This function is specifically for reading CSV files into a DataFrame in PySparkSQL.
It won’t work for loading data into an RDD and different languages (besides Python) have different syntax.
Exercise caution when searching online for help because many resources do not assume Spark SQL or Python.
Data ExplorationNow let’s move into understanding how we can get more familiar with our data!The first thing we can do is check the shape of of our DataFrame.
Unlike Pandas, there is no dedicated method for this but we can use the .
count() and .
columns() to retrieve the information ourselves.
columns)>>> (16719, 16)The .
count() method returns the number of rows in the DataFrame and .
columns returns a list of column names.
NOTE: We don’t have to actually print them because Colab will automatically display the last output of each cell.
If you want to show more than one output you will have to print them (unless you use this workaround, which is super nice and works in Jupyter Notebooks as well)Viewing DataFramesTo view a DataFrame, use the .
show(5)Output from data.
show(5)As you can see, running data.
show(5) displayed the first 5 rows of our DataFrame, along with the header.
show() with no parameters will return the first 20 records.
Let’s see what our data is comprised of using the .
printSchema() method (alternatively you can use .
printSchema()Output from data.
printSchema()Some takeaways from this output is that Year_of_Release and User_Score have a type of string, despite them being numbers.
It also tells us that each of the columns allows null values which can be seen in the first 5 rows.
We can also selectively choose which columns we want to display with the .
Let’s view only Name, Platform, User_Score, and User_Count:data.
show(15, truncate=False)Output from data.
show()Included is the truncate=False parameter that adjusts the size of columns to prevent values from being cut off.
Summary Statistics/InformationWe can use the .
describe() method to get summary statistics on columns of our choosing:data.
show()Output from data.
show()Some takeaways from this output is that there sees to a strange “tbd” value in the User_Score column.
The count for User_Score is also higher than User_Count but it’s hard to tell if that’s because there are actually more values in User_Score or if “tbd” values are artificially raising the count.
We’ll learn how to filter those values out later on.
We might also want to get some information on what kinds of platforms are in the Platform column and how they are distributed.
We can use a groupBy() for this and sort it using .
orderBy("count", ascending=False) .
show(10)Output from data.
show()Here we’re looking at the top 10 most frequent platforms.
We can tell this dataset is pretty old because I don’t see PS4 anywhere ????Filtering DataFramesLets create a new DataFrame that has the null values for User_Score and User_Count, and the “tbd” values filtered out using the .
filter() method:condition1 = (data.
isNotNull()) | (data.
isNotNull())condition2 = data.
User_Score != "tbd"data = data.
filter(condition2)condition1 returns True for any record that does not have a null value in User_Score or in User_Count.
condition2 returns True for any record that does not have “tbd” in User_Score.
We can double check to see if our filtering worked by reconstructing our earlier visualizationsReconstructed summary statistics and DataFrame with filtered out valuesThat’s enough Data Exploration to get started, now lets build a model!Building Models in PySparkBuilding models in PySpark looks a little different than you might be used to, and you’ll see terms like Transformer, Estimator, and Param.
This guide won’t go in-depth into what those terms mean but below is a link to a brief description of what they mean.
Further Reading — Machine Learning in Spark (~5–10 min read)SetupFor an example of linear regression, let’s see if we can predict User_Score from Year_of_Release, Global_Sales, Critic_Score, and User_Count.
First let’s recode all of our predictors to be Doubles (I found that this got rid of some really gnarly errors later on).
types import DoubleTypedata2 = data2.
cast(DoubleType()))data2 = data2.
cast(DoubleType()))data2 = data2.
cast(DoubleType()))data2 = data2.
cast(DoubleType()))We use the method withColumn(), which either creates a new column or replaces one that already exists.
So, for example, the Year_of_Release column is replaced with a version of itself that has been cast as doubles .
VectorAssemblerThe next step is to get our data into a form that PySpark can create a model with.
To do this we use something called a VectorAssembler.
feature import VectorAssemblerinputcols = ["Year_of_Release", "Global_Sales", "Critic_Score", "User_Count"]assembler = VectorAssembler(inputCols= inputcols, outputCol = "predictors")Here we’ve delineated what features we want our model to use as predictors so that VectorAssembler can take those columns and transform them into a single column (named “predictors”) that contains all the data we want to predict with.
predictors = assembler.
columnsOutput from VectorAssembler.
transform() does is create a new DataFrame with a new column at the end where each row contains a list of all the features we included in the inputCols parameter when we created the assembler.
The final step to getting our data ready to be used in a model is to collect the new predictions column we just made and User_Score (our target variable) by themselves in a DataFrame.
model_data = predictors.
show(5,truncate=False)The final data we will use to build a modelNext is to split model_data into a training and testing set:train_data,test_data = model_data.
2])Model TrainingNow to train the model!from pyspark.
regression import LinearRegressionlr = LinearRegression( featuresCol = 'predictors', labelCol = 'User_Score')lrModel = lr.
fit(train_data)pred = lrModel.
evaluate(test_data)After importing LinearRegression from pyspark.
regression, we construct a regressor and we specify that it should look for a column named “predictors” as the features of the model and a column named “User_Score” as the labels of the model.
Next we train it with .
fit() and finally produce predictions with .
We can access the parameters of our model with the following codelrModel.
7985254457876We can also view the final predictions our model made:pred.
show(5)Model predictionsThe object named “pred” is a LinearRegressionSummary object and so to retrieve the DataFrame with predictions we call .
show()Model EvaluatingTo get more detailed information on how our model performed, we can use RegressionEvaluator which is constructed like this:from pyspark.
evaluation import RegressionEvaluatoreval = RegressionEvaluator( labelCol="User_Score", predictionCol="prediction", metricName="rmse")Let’s compute some statistics for the model:rmse = eval.
predictions)mse = eval.
metricName: "mse"})mae = eval.
metricName: "mae"})r2 = eval.
metricName: "r2"})Which returnsrmse>>> 1.
386From this we can interpret that our model tended to be about 1.
125 rating points off from the actual User_Score (according to rmse).
The r² value for tells us that the predictors in our model are able to account for a little under 40% of the total variability in User_Score.
This was just a first pass look, and I recommend playing around with the model parameters and features for more practice!Further Reading — Detailed Code example of Linear Regression (~20+ min to go through the whole thing)Further Reading — Detailed Code example of Logistic Regression using SQL (~10 minutes)Further Reading — Example of Linear Regression, Decision Tree, and Gradient-Boosted Tree Regression (6 min read)This is just the tip of the iceberg as far as the kind of modeling you can do in PySpark, but I hope this guide has equipped you with enough to get your foot into the door of Big Data!ConclusionWow.
Props to you if you made it all the way to the end.
You were exposed to a ton of new concepts, from the terminology of distributed computing and Spark, to implementing data exploration and data modeling techniques in PySpark.
My hope is that this guide can be a resource for you as you continue working with Spark!All the code used in this article can be found on GitHub here.