Finding Burgers, Bars and The Best Yelpers in TownJagerynn Ting VeranoBlockedUnblockFollowFollowingJun 12A Digestible PySpark Tutorial for Avid Python Users — Part 1Photo by Eaters Collective on UnsplashSome time back, Yelp made the move to share its repository to the public.
As a Yelp user, when I learnt that it included a wealth of information on businesses, user reviews and user characteristics, I was eager to work on it.
However, I quickly hit a roadblock when I realized that some of the files were to big too be uploaded onto my Jupyter Notebook.
For someone with no prior coding experience, and who was just beginning to learn the basics of data analysis and machine learning, I decided to put that on hold.
Fast forward to yesterday.
After spending the weekend trying to set up PySpark in Jupyter, I found out that Google Colab provides a much simpler solution to working with big data via PySpark, without all the frustrations of downloading and working with it locally.
Naturally, I decided to work with the Yelp users dataset.
So again, I’ve decided to put the puzzle pieces together to make the journey as painless for someone else as I can make it.
Whether you’ve stumbled across this article, or are facing similar frustrations, I’ve put together a menu for you below, so you can decide for yourself if this is worth your stay.
MenuPhoto by Louis Hansel on UnsplashAppetizerAn understanding of how PySpark handles big data (if not, this video on latency should suffice)Main CourseImporting PySpark onto ColabCommon operations on PySpark DataFrame’s objectData VisualizationModeling with MLlibDessertModel Comparison / Model SelectionHow PySpark Works: A Brief RecapIn summary, PySpark is a distributed computing framework: it allows data to be processed in parallel by distributing it across several nodes.
Distributed Data ParallelismIn general, operations on the memory are computationally cheaper than those on networks and disks.
Spark gives users the option for significantly faster computing than systems like Hadoop by shifting as many operations on memory and minimize the amount on network, thereby reducing network traffic.
A Smooth and Easy PySpark Setup on ColabPhoto by Fahrul Azmi on UnsplashHere is a list of libraries you’ll need to get started:OS — to set your environment variablesfindspark — makes PySpark importablepyspark.
SparkSession — to access Spark functionality and work with Spark DataFramesgoogle.
drive — to access the data file on my Drivethe file type you’re working with (in my case, it’s json)Asif Ahmed wrote a great article which I referenced to aid me in the installation of PySpark.
This process is universal, so anyone can use the same block of code to download PySpark onto Colab.
The only thing you should take note off is the version of Spark you’re downloading.
It should be the latest one available.
I’ve made a comment where this is relevant, and documented some of what is going on in the code below.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null!wget -q https://www-us.
tgz #based on latest version!tar xf spark-2.
tgz!pip install -q findspark#setting the environment variablesos.
environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"os.
environ["SPARK_HOME"] = "/content/spark-2.
7" #based on latest version#finds PySpark to make it importablefindspark.
init()spark = SparkSession.
getOrCreate()To avoid the long wait time required to upload a huge file onto colab, I uploaded my files onto Google Drive and gave my notebook access to it.
mount('/content/drive') #produces a link with instructions to enter an authentication codeOnce you’ve entered the authentication code, click on the arrow button on the left of your Colab Notebook, and search for your file.
Accessing Your Data File Via The Arrow ButtonOnce you’ve found your file, right click on it and copy its path.
Pass its string through spark.
Disclaimer: this will not work if you don’t add a “/” in the beginning of the path you copied!To my pleasant surprise, spark.
json() automatically infers the schema of my nested JSON file, and converts it into a PySpark DataFrame.
ExplorationPhoto by Andrew Neel on UnsplashBelow is a list of methods I’ve applied to my PySpark DataFrame to explore my dataset.
printSchema() returns a neat tree format of information about the dataframe.
Example of a dataframe schema in tree formatdf.
select() is a method for querying.
Features you want to display select and SQL-like operations are passed through this method.
Mentioning features is not necessary, however, without them the query returns boolean values.
Either way, the resulting output is always a list of rows, not a dataframe.
collect() assembles the fragmented dataset that was distributed earlier.
Avoid calling it until you need collated results.
udf or a user-defined function is a function that can be applied to every to dataframe columns.
Two arguments are needed: the transformation and the expected data type of your transformed variable.
They are used in conjunction with df.
In addition, beyond employing the functions found in pyspark.
functions, it is possible to define your own functions and pass them through the udf method.
createDataFrame() transforms output of query (a list of rows) into a dataframe.
show() displays a PySpark dataframe object with 20 rows.
show(n) displays only the first n rows of the dataframe.
Output of df.
show() on Colabdf.
withColumn() returns a new dataframe consisting of the original dataframe and a new column with a specified operation.
The two arguments required for this method are the new column name and the operation.
Example Use CaseThe dataset gave me a list of friends per user, and from it I created a column with the number of friends per user.
#create a user-defined functionsplit_count = udf(lambda z: len(z.
split(“, “)), IntegerType()) #transform and attach a new column to the original dataframedf_user1 = df_user.
friends))VisualizationCalling the display function easily allows you to display visualizations in PySpark dataframe format.
Alternatively, by calling .
toPandas(), you can easily employ seaborn, matplotlib and other visualization libraries you desire to use.
Example Use CaseHistogram of the Average Rating By Each Useravg_stars_query = df_user.
collect()avg_stars_df = spark.
toPandas(), bins = 4, kde = False)Building a model with MLlibPySpark’s ML library is very similar to that of sklearn’s, with some minor differences.
Instead of train_test_split, for example,randomSplit is used.
Here is a general breakdown of model building with MLlib:merge features into one column with pyspark.
VectorAssembler or mllib.
DenseVectorScale features, e.
StandardScalerperform train test split using randomSplittrain model with 1 “features” column and 1 “label” columnevaluate with metrics, e.
RegressionMetricsUse CasePhoto by Marvin Meyer on UnsplashI was curious to know the different profiles of Yelpers, so I performed K-Means clustering and accessed their WSSE at varying levels of k.
The optimal k is found at the “elbow” where the indent of the graph is significantly greater, and in a way that imitates the bend of an arm.
This is known as the elbow method.
Despite finding an optimal level of k, analyzing the clusters can still be tough with so many features.
Dimensionality reduction would be useful in simplifying the model.
Dimensionality ReductionCorrelation Heatmap of Variables Based on Sample (n = 500,000)It seems like there isn’t a clear-cut way to select features based on this heatmap, since majority had a correlation coefficient of .
In addition, I tinkered with dimensionality reduction in PySpark, but found no direct solution to retrieve back feature names or rank features by importance for the purposes of interpretation.
Yet another solution would be to transform the variables.
By merging the compliments the user gave into one column and the compliments the user received into another, I was able to reduce the number of features to 6.
Elbow Method: Finding the Optimal Number of ClustersIn the graph above, the number of clusters range from 2 to 24.
The elbow occurs at k = 6, i.
6 distinguished user profiles can be found.
Alternative SolutionsAlternative Clustering AlgorithmsThe other algorithms offered by PySpark’s MLlib package include gaussian mixture models, LDA (often used in text mining) and bisecting k-means.
Bisecting k-means is a hierarchical clustering algorithm that employs a top-down approach at splitting the data, and is preferable for big datasets.
SubsamplingSubsampling, say, about a third of the data (about 500,000 cases) and calling .
toPandas() will allow you to perform feature selection and conduct training in scikit-learn on the newly converted Pandas dataframe.
Bonus: Hyperparameter TuningPhoto by Rodion Kutsaev on UnsplashPySpark also enables users to select best models by providing pipelining and hyperparameter tuning functionalities.
Here are some terms you are likely to come across while selecting your model:Estimators: algorithms or a pipelineEvaluators: evaluation metricsCross-validation and train-validation splitNamed parameters, i.
paramMaps has a specified (parameter, value) setParameter grid constructor, i.
paramGridBuilderNote: there is also a PySpark-sklearn library for grid searching.
My ReviewPhoto by Alex Ware on UnsplashTraining the model on my laptop was very time consuming.
Several attempts at other clustering algorithms and grid searching were not feasible simply because of memory space issues.
Nevertheless, PySpark is a great tool if you want to be able to handle large data for free, and many companies use it today.
I’ve also recently learnt of the existence of PySparkling, which is a combination of H2O, an automated, easy-to-use machine learning platform, and PySpark.
What You Can Be Looking Forward ToIn part 2, we will walk through cluster analysis and present to you the findings from each cluster.
In summary, we recapped how PySpark handles big data, how to set the system up on Colab, some of the common methods used when working with dataframes (as opposed to RDDs), and how to train a model in PySpark.
Hope you enjoyed reading this article!.Feel free to leave your thoughts (or tips) below.
References Asif Ahmed, PySpark in Google Colab, Towards Data Science Machine Learning Library (MLlib) Guide, ApacheSpark 2.