Plotly Experiments — Column and Line plotsNaren SanthanamBlockedUnblockFollowFollowingMar 13Bar plotsIn my previous post, I had explained how to create scatterplots using Plotly with examples from the King County housing dataset.
Another popular type of plot is the column plot or bar plot.
Unlike a scatterplot, which is used to compare two numerical variables against each other and examine relationships, bar / column plots are useful to investigate one or more numerical variables across different categories.
There are different types of bar plots — individual, clustered, stacked, etc.
The datasetI will be using the Bay Area Bike Share dataset from Kaggle for this post.
The Bay Area Bike Share enables quick, easy, and affordable bike trips around the San Francisco Bay Area.
They make regular open data releases (this dataset is a transformed version of the data from this link), plus maintain a real-time API.
The dataset has various files that contain information on the different stations in the Bay Area, the trips taken each day along with the weather info.
I will be using the stations and trips datasets for this data exploration exercise.
StationsLet us take a look at the stations dataset:The dataset contains the name, location and the capacity of each station in the Bay Area.
Which cities have the highest number of stations, and capacity (docks)?This can also be depicted through a ‘waterfall’ chart.
Let us see how.
Let us look at a “clustered” column chart, with no.
of stations and docks for each city grouped together.
TripsLet us examine the trips dataset now.
This contains information about the bike trips taken from and to the various stations that we saw above.
The start and end date columns contain date and time information that could be useful for our analysis.
Let us extract more information from these columns.
Let us ask some questions to answer through our visualizations.
What is the distribution of duration of trips?What are the popular months / days / hours among bike renters?Which bike stations are the most popular?How does subscription type affect these parameters?We’ll see charts answering each question above, along with variations caused by the difference in subscription types.
Duration distributionLet us examine the distribution of the trip length through a histogram.
Now let us split the histogram by subscription type, and see if the trip duration varies between customers and subscribers.
Subscribers are users who use the bike share regularly and have membership with Bay Area Bike Share.
Customers on the other hand, do not have memberships and use bikes on-demand.
It is clear that customers tend to use bikes for longer duration than do subscribers!Popular times for bike tripsWhen do people take bike trips mostly?.Which hours of the day, days of the week and months of the year are the most popular among bikers?.And how does that vary between customers and subscribers?.Let us explore.
Popular months of the yearLet us start with plotting the no.
of trips by month.
Does subscription type differ by month?.Let us see.
Overall, the month-wise analysis of the no.
of trips shows us that ridership tends to be low in the winter months and it tends to gradually increase from spring to summer and fall.
The proportion of customer to subscriber does not seem to change much based on the month.
Let us now see how the day of week affects ridership.
Popular days of the weekOn which days of the week are bike trips higher?.Does the trend vary by weekday and weekend?.Let us explore.
Since the dataset is bigger, I will focus our analysis on the last three months of 2013 only.
Clearly, the usage is lower on the weekends than it is on the weekdays.
How does subscription type differ on these days?It looks like the subcriber usage is higher on weekdays, and lower on weekends.
I think it would be better if we plot the percentage numbers in a stacked column than the absolute numbers.
We can accomplish it this way:We can deduce the following from the analysis of the no.
of trips on different days of the week:Subscribers mostly tend to use the bikes during weekdays.
This indicates that they may be using it for commuting to and from work (we can confirm this later when we do the analysis by hour)Customers mostly tend to use the bikes during weekends and holidays (in the above chart, you can see that customer usage was higher on Christmas than that of subscribers, even though it was a weekday)Popular hours of the dayLet us now see during which hours of the day bikes are highly used.
Since this is a lot of data points, I will examine just 1 week worth of data, say the 1st week of Dec 2013.
The number of trips tends to be higher during weekdays, especially during the morning and evening (8AM and 5PM).
We already saw the subscribers use the bikes mostly during weekdays.
This confirms our assumption that subscribers are people who use it mostly for their daily commute to and from work.
All the code used to generate the plots in this post is available on GitHub.
Line PlotsLine plots are generally useful to investigate the trend of a numerical variable over time.
Any time series analysis is incomplete without a line chart.
There are various flavors of line charts — with and without markers, area charts, stepped line charts, linear and smoothed lines, etc.
Let us explore these in this notebook.
In Plotly, line charts are just a variation of scatterplots, only with a line connecting the dots.
Thus, we will be using the scatter (or scattergl) function for plotting purposes.
Let us use the same dataset and explore it using line plots.
To start with, let’s plot the no.
of trips by date.
Admittedly, the chart looks too crowded, as we have tried to cram in a wide time period into one chart.
Thankfully, Plotly provides a very handy tool called the rangeslider, which will enable the user to just select a specific timeframe very easily.
Let’s see how.
The ‘smaller’ graph that you can see below the x-axis is called the range slider.
You can click and drag the sliders to zoom in on a specific timeframe.
One could see that the ridership tends to run low in winter, especially towards the end of the year.
Let us plot some moving averages to smooth out the curve and see the pattern.
The moving average curves clearly show how the ridership goes down during the winter months.
Markers + LinesLet us now zoom into a specific window of the timeframe and do some analysis.
I will focus on Q2 2014.
We can see that there is a weekly, cyclic pattern to the data.
In Time Series analysis, this is referred to as ‘seasonality’.
Let us color the markers based on the day of week.
This chart clearly indicates that the ridership goes down on weekends.
Why was ridership low on May 26, 2014, though?.The answer is here.
Another way to highlight the weekends, is to draw boxes through shapes in Plotly.
Let’s see how.
Area chartsArea charts differ from line charts in that the area under the line gets shaded, giving the viewer a sense of the magnitude of the number being plotted.
For example, it may be more appropriate to plot stock price through a line chart, but market cap through an area chart.
Area charts are also useful when used in a stacked model, showing the difference between two numerical quantities in a shaded area.
Let us see the ridership count as an area chart to start with.
Let us view the split between customer and subscriber and plot the difference on an area chart.
We can also stack these categories on top of each other, which lets us compare the numbers.
Stepped Line ChartsA stepped line chart connects points through vertical and horizontal lines, instead of a straight line joining the two points.
This gives the user a view of where there are sharp increases and decreases and where the numbers hold steady.
I hope this post helped you learn how to plot different types of column, bar and line charts in Plotly.