If we quote the pandas documentation:By “group by” we are referring to a process involving one or more of the following steps:* Splitting the data into groups based on some criteria.
* Applying a function to each group independently.
* Combining the results into a data structure.
So basically it’s making groups out of your data based on some indicator, to enable yourself to do some actions on these groups.
1 Groupby #1: get total sumLets look at an example.
Say we want to get the total value of each group based on ID .
This would like like the following in Python code:Which will give us the following output:Aggregation with sumSo if we look at out DataFrame again, we can see that this is correct:Original dataframeFor example for ID A1 the total value is 100 + 120 = 220 ,which is correct.
2 Groupby #2: get the highest datePandas provides a big range of function you can use on your groups after using groupby.
Let’s look at one more.
For example, we can get the highest date per group by using the .
That would look like this:And would give us the following output:Aggregation with max date4.
JoinsJoins are combining two dataframes on a side by side manner based on a common column.
Most of the time these columns are referred to askey columns .
The term joinis originated from the database language SQL, and was needed because the data modelling of SQL databases is mostly done by using relational modelling.
There are many types of joins, and your output will be based on which type of join your perform.
Because this is an introductionary tutorial, we will look at the most common one: inner join.
In later parts of these series we will look at more complex joins.
The inner join is derived from venn diagrams which represents inner (intersection) part of both sets.
So when we translate this to our data, an inner join returns the rows which are present in both dataframes.
1 Our datasetsBecause we want to combine two dataframes, we will create new data.
These two imaginary dataset represent customers master table and an orders table.
With the following code we create two new dataframes:And they look like the following:dfA: Customers masterdfB: OrdersSo one logical analysis we could do on this new data, would be to get the names and city of the customers next to each order in the orders table.
This is a typical join problem, matching two dataframes row-wise and enriching the data with more columns.
In this case, our key-column is the Customer_ID.
In pandas we use the merge method for joining.
We will pass the following arguments to this method:Which dataframes you want to join (dfA, dfB).
What are the key columns (Customer_ID).
Which type of join you want to perform (Inner).
There are more arguments we can use in the merge method than the ones listed above, but for now these are sufficient.
The merge we want to perform looks like following in pandas:And the output is as we expected, the name and city columns are added next to each corresponding customer_ID.
Inner joinSo that was it for this part: basic data analysis techniques every data analyst should know, using Python.
You can find the code of this article on my GitHub in the form of a Jupyter Notebook: LinkIf this article was useful for you, please consider giving this article a like and share this with friends and/or colleagues.
For any questions or other discussion, feel free to comment.
Expect part II soon, where we go a bit more advanced.