Filter, Aggregate and Join in Pandas, Tidyverse, Pyspark and SQLYu ZhouBlockedUnblockFollowFollowingNov 18, 2018Alike but different (Source)The Blessing and CurseOne of the most popular question asked by inspiring data scientists is which language they should learn for data science.
The initial choices are usually between Python and R.
There are already a lot of conversations to help make a language choice (here and here).
Selecting an appropriate language is the first step, but I doubt most people end up using only one language.
In my personal journey, I learned R first, then SQL, then Python, and then Spark.
Now in my daily work, I use all of them because I find unique advantages in each of them (in terms of speed, ease of use, visualization and others).
Some people do stay in only one camp (some Python people never think about learning R), but I prefer to use what available to create my best data solutions, and speaking multiple data languages help my collaboration with different teams.
As the State of Machine Learning points out, we face a challenge of supporting multiple languages in data teams, one down side of working cross languages is that I confuse one with another as they may have very similar syntax.
Respecting the fact that we enjoy a data science ecosystem with multiple coexisting languages, I would love to write down three common data transformation operations in these four languages side by side, in order to shed light on how they compare at syntax level.
Putting syntax side by side also helps me synergize them better in my data science toolbox.
I hope it helps you as well.
Three Data Operations in Four LanguagesData science landscape is expansive, I decide to focus on the largest common denominator in this post.
Without formal statistics to back up, I take the following for granted: most data science work is on tabular data; the common data languages are SQL, Python, R and Spark (not Julia, C++, SAS and etc.
SQL has a long list of dialects (hive, mysql, postgresql, casandra and so on), I choose ANSI-standard SQL in this post.
Pure Python and Base R is capable of manipulating data, however I choose Pandas for Python and Tidyverse for R in this post.
Spark has RDD and Dataframe, I choose to focus on Dataframe.
Spark has API in Pyspark and Sparklyr, I choose Pyspark here, because Sparklyr API is very similar to Tidyverse.
The three common data operations include filter, aggregate and join.
These three operations allow you to cut and merge tables, derive statistics such as average and percentage, and get ready for plotting and modeling.
As data wrangling consumes a high percentage of data work time, these three common data operations should account for a large percentage of data wrangling time.
Use criteria to filter tablesInclusion and exclusion are essential in data processing.
We keep the relevant and discard the irrelevant.
We can call inclusion and exclusion with one word: filter.
Filter has two parts in tabular data.
One is filtering columns and the other is filtering rows.
Using the famous Iris data in this post, I list the filter operations in the four languages below.
Question: What is the sepal length, petal length of Setosa with petal width larger than 1 ?# SQLselect Sepal_Length, Petal_Length from Iris where Petal_Width > 1 and Species=’setosa’;# PandasIris[(Iris.
Petal_Width > 1) & (Iris.
Species==’setosa’)][[‘Sepal_length’,’Petal_Length’]]# TidyverseIris %>% filter(Petal_Width > 1, Species==’setosa’) %>% select(Sepal_Length, Petal_Length)# PysparkIris.
Petal_Width > 1) & (Iris.
Petal_Length)Pandas uses brackets to filter columns and rows, while Tidyverse uses functions.
Pyspark API is determined by borrowing the best from both Pandas and Tidyverse.
As you can see here, this Pyspark operation shares similarities with both Pandas and Tidyverse.
SQL is declarative as always, showing up with its signature “select columns from table where row criteria”.
Derive aggregate statistics by groupsCreating summary statistics such as count, sum and average are essential to data exploration and feature engineering.
When categorical variables are available, it is also very common to group summary statistics by certain categorical variables.
Question: How many sepal length records does each Iris species have in this data and what is their average sepal length ?# SQLselect Species, count(Sepal_Length) as Sepal_Length_Count, avg(Sepal_Length) as Sepal_Length_mean from Iris group by Species;# Pandasaggregated=Iris.
columns = [‘_’.
rstrip(‘_’) for tup in temp1.
values]# TidyverseIris %>% group_by(Species) %>% summarize(Sepal_Length_mean=mean(Sepal_Length), Count=n())# Pysparkfrom pyspark.
sql import functions as FIris.
alias(‘sepal_length_count’))This example is a good one to tell why the I get confused by the four languages.
There are four slightly different ways to write “group by”: use group by in SQL, use groupby in Pandas, use group_by in Tidyverse and use groupBy in Pyspark (In Pyspark, both groupBy and groupby work, as groupby is an alias for groupBy in Pyspark.
groupBylooks more authentic as it is used more often in official document).
In terms of aggregation, Python is very different here.
One, it uses a dictionary to specify aggregation operations.
Two, it by default uses the group by variable as index, and you may have to deal with multiindex.
Join tables to put features togetherOne hallmark of big data work is integrating multiple data sources into one source for machine learning and modeling, therefore join operation is the must-have one.
There is a list of joins available: left join, inner join, outer join, anti left join and others.
Left join is used in the following example.
Question: given a table Iris_preference that has my preference on each species, can you join this preference table with the original table for later preference analysis?# SQLselect a.
* from Iris a left join Iris_preference b on a.
merge(Iris, Iris_preference, how=’left’, on=’Species’)# Tidyverseleft_join(Iris, Iris_preference, by=”Species”)# PysparkIris.
join(Iris_preference,[‘Species’],”left_outer”)It is really amazing that we have many ways to express the same intention, in programming languages and in natural languages.
Cheat Sheets for Data WranglingLanguage abundance is a blessing and curse.
It may not be a bad idea for the community to standardize the APIs for the common data operations cross the languages in the future, which could eliminate frictions and increase portability.
As of now, I survey the filter, aggregate and join operations in Pandas, Tidyverse, Pyspark and SQL to highlight the syntax nuances we deal with most often on a daily basis.
The side by side comparisons above can not only serve as a cheat sheet to remind me the language differences but also help me with my transitions among these tools.
I have a more concise version of this cheat sheet at my git repository here.
Additionally, you may find other language specific cheat sheets helpful, they are from Pandas.
org, R Studio and DataCamp.
Data Wrangling with Pandas: linkData Wrangling with dplyr and tidyr: linkPython for Data Science Pyspark: linkIf you want to add extra pairs of data operations into my cheat sheet, let’s connect!.. More details