Why And How To Use Merge With Pandas in PythonAdmond LeeBlockedUnblockFollowFollowingMar 2It doesn’t matter whether you’re a data scientist, data analyst, business analyst, or data engineer.
If you’ve been using Python in your work — especially for data preprocessing/cleaning — you’d have used Pandas in some ways.
Why “Merge”?You’d have probably encountered multiple data tables that have various bits of information that you would like to see all in one place — one dataframe in this case.
And this is where the power of merge comes in to efficiently combine multiple data tables together in a nice and orderly fashion into a single dataframe for further analysis.
“Merging” two datasets is the process of bringing two datasets together into one, and aligning the rows from each based on common attributes or columns.
The words “merge” and “join” are used relatively interchangeably in Pandas and other languages.
Despite the fact that Pandas has both “merge” and “join” functions, essentially they both do the similar things.
That said, we’ll focus solely on the “merge” function used in Pandas throughout this article.
What exactly is “Merge”?(Source)Looking at the documentation of the merge function (pandas version = 0.
1), it looks foreign and not easily understandable to readers (at least to me) at the first glance.
After seeking more information and explanation from some of my friends and online resources, I started understanding how this concept — merge could actually be explained in a much simpler way and began to appreciate the beauty of this merge function in Pandas.
To understand pd.
merge, let’s start with a simple line of code as below.
What this line of code does is to merge two dataframes — left_dfand right_df — into one based on their values with the samecolumn_name available in both dataframes.
With the how='inner', this will perform inner merge to only combine values in the column_name that match.
merge(left_df, right_df, on='column_name', how='inner'Since the method how has different parameters (by default Pandas uses inner), we’ll look into different parameters (left, right, inner, outer) and their use cases.
After that we’ll explain the method on and elaborate further on how to merge two dataframes if they have different column names (TIPS: method on will not be used).
The explanation below comes largely from the great tutorial by Shane Lynn and the data used (3 CSV files) comes from the KillBiller application.
You can get the data HERE.
Let’s get started!A Quick Look at the DataLet’s first understand the data sets used with the following explanation on each dataframe.
user_usage — A first dataset containing users monthly mobile usage statisticsuser_device — A second dataset containing details of an individual “use” of the system, with dates and device informationandroid_device — A third dataset with device and manufacturer data, which lists all Android devices and their model codeuser_usageuser_deviceandroid_deviceIt’s important to note here that:The column name use_id is shared between the user_usage and user_deviceThe device column of user_device and Model column of the android_device dataframe contain common codes1.
LEFT MergeKeep every row in the left dataframe.
Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.
LEFT MergeWith the operation above, left_merge has the same size as user_usage as we keep all the rows in the left dataframe using the left parameter for the method how .
LEFT Merge (Top 5 rows)LEFT Merge (Last 5 rows)As expected, the column use_id has already been merged together.
We also see that the empty values are replaced by NaN in the right dataframe — user_device.
RIGHT MergeTo perform the right merge, we just repeat the code above by simply changing the parameter of how from left to right.
RIGHT MergeWith the operation above, right_merge has the same size as user_device as we keep all the rows in the right dataframe using the right parameter for the method how.
RIGHT Merge (Top 5 rows)RIGHT Merge (Last 5 rows)This time, we see that the empty values are replaced by NaN in the left dataframe — user_usage.
INNER MergePandas uses “inner” merge by default.
This keeps only the common values in both the left and right dataframes for the merged data.
In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.
INNER MergeAlthough the “inner” merge is used by Pandas by default, the parameter inner is specified above to be explicit.
With the operation above, the merged data — inner_merge has different size compared to the original left and right dataframes (user_usage & user_device) as only common values are merged.
OUTER MergeFinally, we have “outer” merge.
The “outer” merge combines all the rows for left and right dataframes with NaN when there are no matched values in the rows.
OUTER MergeNotice that the method indicator is set to True in order to indicate where each row originates from in the merge data — outer_merge.
Specified rows in outer_mergeTo further illustrate how the “outer” merge works, we purposely specify certain rows of the outer_merge to understand where the rows originate from.
For the 1st and 2th rows, the rows come from both the dataframes as they have the same values of use_id to be merged.
For the 3rd and 4th rows, the rows come from the left dataframe as the right dataframe doesn’t have the common values of use_id.
For the 5th and 6th rows, the rows come from the right dataframe as the left dataframe doesn’t have the common values of use_id.
????.Merge Dataframes with Different Column NamesSo we’ve talked about how to merge data using different ways — left, right, inner, and outer.
But the method on only works for the same column name in the left and right dataframes.
Therefore, we use left_on and right_on to replace the method on as shown below.
LEFT Merge for dataframes with different columns namesHere we’ve merged user_device with android_device since they both contain common codes in their columns — device and Model respectively.
LEFT Merge for dataframes with different column namesFinal Thoughts(Source)Thank you for reading.
Merging data from different data tables is one of the most common tasks that we as a data professional have to do.
I hope this sharing is beneficial to you in some ways.
As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.
Till then, see you in the next post!.????About the AuthorAdmond Lee is a Big Data Engineer at work, Data Scientist in action.
He is known as one of the highly sought-after data scientists and consultants in helping start-up founders and various companies tackle their problems using data with deep data science and industry expertise.
You can connect with him on LinkedIn, Medium, Twitter, and Facebook.
Admond Lee Kin Lim – Big Data Engineer – Micron Technology | LinkedInView Admond Lee Kin Lim's profile on LinkedIn, the world's largest professional community.
Admond has 12 jobs listed on…www.
com.. More details