In that article, I talked about why EDA is important in data science and how data can be explored and visualized in a simpler way to give meaningful insights to you, or potentially your stakeholders.
To understand your data and communicate results with stakeholders, data visualization is of utmost importance to give data a story to tell — storytelling.
Since the common scenarios here span across different types of datasets, this article focuses more on showing and explaining what the codes are used for as well as the plots so that you can plug and play easily in your projects.
At the end of this article, I hope you’ll find the codes useful and that would make your data visualization process much more fun, faster and effective!Let’s get started!Background of DatasetThroughout this article, we’ll use the E-Commerce dataset obtained from Kaggle for data visualization (More detailed information on the data can be found here).
In short, the data consists of transactional data with customers in different countries who make purchases from an online retail company based in the United Kingdom (UK) that sells unique all-occasion gifts.
The following codes can in fact be generalized to other dataset based on your needs with some minor adjustments.
The goal here is to show you how I usually perform data visualization given some generic dataset.
Also, the codes are by no means an exhaustive compilation to cover all kind of plots but they should be fundamentally sufficient to get you started.
The data shown here has also gone through some data cleaning so that we can use it directly and focus on data visualization.
In case you want to know how the data cleaning was done, you can always refer to this article written previously.
The Jupyter notebook and clean data for this data visualization has been uploaded to my GitHub.
Snapshot of how the data looks likeEach column is pretty self-explanatory given that we’re dealing with typical e-commerce data.
Let’s see what we can do to visualize this data!My Little Toolbox for Data Visualization1.
Boxplot — Unit PriceBoxplot for unit price of the itemsUnit price here means price for each item.
In the e-commerce world, we are curious about the spread of the unit price to understand its distribution of price.
We used Seaborn to do the boxplot (one of my favourite tools!) with just only one line of code and the rest is solely for labelling purpose.
From the plot we see that the majority of unit price is less than $800 and the highest unit price can reach more than $8000.
Let’s go for the next step.
Distribution Plot — Quantity SoldDistribution plot of quantity soldAgain, we used Seaborn to do the distribution plot.
In this case, we only take quantity sold (less than 100) into account as this is where the majority of the data lies within.
We see that most items are sold within the quantity of 30.
What about the number of orders sold to each country?3.
Horizontal Bar chartBar chart of the number of orders for different countriesSince the online retail company is based on the UK, it is no surprise that United Kingdom has the highest number of orders made.
Therefore, we intentionally neglected this country for more meaningful comparison among other countries.
You may have noticed by now, dataframe.
groupby is extremely useful when it comes to plotting continuous variables grouped by some categorical variables.
You can even directly plot from the dataframe without having to use matplotlib.
Whether to use vertical or horizontal bar chart depends on your needs.
We chose horizontal bar chart in this case to show the name of each country in a more clearer fashion.
We’ll see how vertical bar chart can be used in the next section.
Vertical Bar Chart (With Annotation)Number of orders for different daysHere comes the vertical bar chart with annotation.
Sometimes we may want to show a vertical bar chart with percentage annotation to show the portion occupied by some variables.
In our context, we want to know the number of orders for different days and look at their respective percentage for more insights.
A code sample is attached above to show you how to annotate percentage in the same plot without affecting the visual.
Bar Chart & Line Plot (Combined)Combined bar chart and line plot to show the total amount spent for different monthsFinally, we want to know the total amount spent by customers (or total sales made) for each month.
At some point in time we may also want to know the percentage change between the current and prior element.
In this case, we can make a line plot to know the percentage change from the previous month to the current month — all in one plot.
Use this combined plot wisely and sparingly as this may cause confusion to people with information over-packed in one plot.
Again, the usability of the combined plot depends on situation and needs.
Final Thoughts(Source)Thank you for reading.
Data visualization is nothing but a storytelling.
Who is your audience?.What are the takeaways that you want your audience to get from the visualization?.What are the actionable insights to be executed?I hope this little toolbox of data visualization would help you in data visualization in some ways.
If you’re interested in learning how to visualize data and perform storytelling to capture audience’s attention and convey your ideas effectively, I strongly encourage you to check out this book — Storytelling with Data: A Data Visualization Guide for Business Professionals.
As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.
Till then, see you in the next post!.????Kin Lim Lee – Big Data Engineer – Micron Technology | LinkedInView Kin Lim Lee's profile on LinkedIn, the world's largest professional community.
Kin Lim has 12 jobs listed on their…www.