Training Your First Classifier with Spark and ScalaJeremy MillerBlockedUnblockFollowFollowingFeb 27Many people begin their machine learning journey using Python and Sklearn.
If you want to work with big data you have to use Apache Spark.
It is possible to work with Spark in Python using Pyspark.
However, since Spark is written in Scala, you will see much better performance by using Scala.
There are numerous tutorials out there about how to get up and running with Spark on your computer so I won’t go into that.
I will only suggest that two ways to get started quickly are to use a docker image, or the community version of Databricks.
Let’s get started!I prefer to use the spark-shell and start it with the color option enabled:These imports will help with file navigation while inside the spark-shell:Next come all of our imports.
It looks like some repetition; this is because some of the functionality in the Spark RDD-API has yet been ported over to the newer Spark Dataframe-API:If you are not using the spark-shell you may need the following additional imports.
The spark-shell automatically creates a spark context as “sc” and a spark session as “spark”.
Now we can load in the data.
I’m using the Harvard EdX dataset as an example.
NB: I’m not going to do much feature engineering because I want to focus on the mechanics of training the model in Spark.
In the end I will have a classifier that predicts whether a student passes a course based on data accumulated throughout the entire course.
It better have a good score!.Creating a useful model would require featurizing the data to train the model on what you would know at the time that you wanted to make the prediction.
With that clarified, here we go:A Spark model needs exactly two columns: “label” and “features”.
To get there will take a few steps.
First we will identify our label using the select method while also keeping only relevant columns (see the caveat above about feature engineering):Putting the entire method call in a set of parentheses allows you to break up the lines arbitrarily without Spark freaking out.
Next we will do some one-hot encoding on our categorical features.
This takes a few steps.
First we have to use the StringIndexer to convert the strings to integers.
Then we have to use the OneHotEncoderEstimator to do the encoding.
Next we check for null values.
In this dataset I was able to find the number of null values through some relatively simple code, though depending on the data it may be more complicated:After checking the columns, I decided to impute the null values of the following columns using the median value of that column: nevents, ndays_act, nplay_video, nchapters.
I did this like so:Then we use the VectorAssembler object to construct our “features” column.
Remember, Spark models need exactly two columns: “label” and “features”.
Now we split the data into training and test sets.
The data is set up!.Now we can create a model object (I’m using a Random Forest Classifier), define a parameter grid (I kept it simple and only varied the number of trees), create a Cross Validator object (here is where we set our scoring metric for training the model) and fit the model.
WARNING: This code will take some time to run!.If you have a particularly old / underpowered computer, beware.
Now we have a trained, cross validated model!.You can explore the attributes and methods of the model by typing “model.
” and then pressing tab on your keyboard (note the period after the word model).
I encourage you to spend some time here to get a sense of what this model object is and what it can do.
It’s time for some model evaluation.
This is a little more difficult because the evaluation functionality still mostly resides in the RDD-API for Spark, requiring some different syntax.
Let’s begin by getting predictions on our test data and storing them.
We will then convert these results to an RDD.
Then we can create our metrics objects and print out the confusion matrix.
Now we have some results!.You can use the numbers in the confusion matrix to calculate your various metrics.
Spark will do this for us and print them out, but the syntax is bulky:We can also calculate more sophisticated metrics such as AUC and AUPRC:Et voila!.We have trained and evaluated our classifier!.I hope you see that using Apache Spark for machine learning is only a bit more complicated than using libraries such as Sklearn or H2O.
This extra effort pays off by allowing you to work with big data.
I encourage you play around with different models available in the Spark ML Library.
Thanks for reading.