Exploring Toronto Bike Share Ridership using PythonYizhao TanBlockedUnblockFollowFollowingMay 16Over the past few years, Bike Share has become an increasingly attractive way to get around the city.
Considering the amount of bike theft in my neighbourhood, Bike Share’s speedy expansion of station coverage, and rising TTC fares, it was an easy decision for me to renew my membership for the third year this March.
People were making similar decisions in the US as well: in 2017, cyclists in the US took 35 million trips through a bike-sharing program, a 25% increase from 2016.
Meanwhile, Bike Share Toronto saw an 81% ridership increase during the same time period.
I wanted to explore the Bike Share Ridership dataset to better understand how Torontonians are using Bike Share and to demonstrate some techniques that can be used to explore the other 294 (and counting) datasets released by the Open Data Program.
About the DataBike Share collects data using systems provided by third-party organizations.
The availability of historical ridership data is limited due to a change of provider in 2016.
I decided to focus on the data from 2017 instead.
The data can be retrieved from the Open Data Catalog or from the new Portal (the new site replacing the Catalog later this year).
The data is a ZIP file containing multiple CSVs, separating the yearly data by quarters.
An initial review of the Q1 data shows that each record within the data represents a unique trip and contains the following columns:trip_id: unique identifier for each triptrip_start_time: time when the trip startedtrip_stop_time: time when the trip endedtrip_duration_seconds: duration of the trip measured in secondsfrom_station_id: station ID where the trip startedfrom_station_name: name of the station where the trip startedto_station_id: station ID where the trip endedto_station_name: name of the station where the trip endeduser_type: identify if the user has a membership or purchased a passHowever, further exploration of the data in the other files reveal a few issues:Datetime format varies between files (eg.
31/12/2017 12:30 vs.
12/30/17 12:30:00)Only Q1 and Q2 data contain station IDsInconsistent station names (eg.
Lake Shore Blvd vs.
Lakeshore Blvd)Trips with the same start and end stations, suggesting potential data collection errorsNot ideal, but also fairly common in a typical data analysis project.
It’s not unusual for data cleaning to take up to 80% of the time in any analysis.
Data from the real world is often messy and it must be cleaned before any analysis or modelling.
While data cleaning is tedious and time-consuming, it also has a significant impact on the final result.
For this analysis, I created a new directory containing the Jupyter notebooks and a data folder containing all the data that I used.
Data CleaningThe objective of this step is to consolidate the data from multiple sources into a single pandas DataFrame, containing standardized dates and station information, as well as with the outliers removed.
See this article for an in-depth breakdown of common data cleaning steps covering a few more other cases.
Bike Share exposes APIs endpoints containing information about their services.
The station_information endpoint can be used as the source of truth to resolve inconsistencies between the station IDs and names, and also to enhance the data with additional geospatial information of each station.
StandardizationFirst, I imported the required libraries and loaded the station data from the Bike Share API endpoint:Next, I manually identified the date structure used in each file and concatenate the ridership data into a single DataFrame using the identified structure.
Also, the data from Q1 and Q2 are in UTC timezone (4 hours ahead) while the data from Q3 and Q4 are in Eastern timezone.
This was not immediately obvious and I only realized this issue later on when I visualized the data.
Data cleaning is an iterative process, and its almost impossible to catch all the issues on the first pass.
Next, I needed to resolve the issues with the station IDs and names.
To improve efficiency, I extracted the unique combinations of station ID and station name.
The stations with IDs can be updated easily from the API data, but the stations without IDs require a slightly more complex solution.
Fuzzy matching is a technique used to identify things that are similar (eg.
vs Yonge Street).
There are many ways to fuzzy match strings in Python and I opted to use the fuzzywuzzy library.
Finally, I concatenated the station data, merged it with the API data to include the stations’ location coordinates, and updated the DataFrame with the correct station IDs and names.
Conversations with the Toronto Parking Authority (TPA), the city division responsible for managing Bike Share, revealed that the station IDs correspond with the physical terminal rather than the location.
If a terminal moves, the API would only show the new coordinates.
Therefore, by merging with the API data on the station IDs, there are some trips where the start or end locations are incorrect.
OutliersBefore removing the outliers, I used the describe() function to see a simple profile of the data within the DataFrame (I excluded all ID columns for a clearer view):Considering Bike Share’s pricing model, I expected the majority of the trips to have taken less than 30 minutes.
But the data showed a trip duration ranging from 1 to 6,382,030 seconds (~74 days), suggesting outliers in the data.
TPA generally consider trips less than a minute to be false trips in their own analysis.
I removed these 29,595 trips, about 2% of the total trips.
A common statistical method of removing outliers from data is by considering the interquartile range (IQR), the middle 50% of the data.
The outliers are defined as the data points that are ±1.
5*IQR away from the median.
I removed some outliers using this method.
See this article for an excellent step-by-step example.
Finally, I saved the cleaned data for later use.
Research Question DefinitionsBefore jumping into the analysis, we should define the questions we want to answer by the end of the analysis (or else I often end up creating graphs after graphs that lead to no particular conclusion).
We took a fairly informal approach for this analysis:Brainstormed a variety of questions related to Bike Share and biking in the cityIdentified whether the questions could be answered by the data available or if they required other external datasetsRemoved questions that can’t be answered due to limited access to data (eg.
user information are removed due to privacy reason)Grouped the questions remaining into major themesA sample of the questions identified relating to Bike Share and biking in TorontoThe final major categories we identified were:Who is using Bike Share?.Are there distinct differences between members vs.
casual users behaviours?.If so, what are these differences?When are people using Bike Share?.How does usage change across the year, the week, and the day?How is Bike Share being used?.Are people mostly using Bike Share as a way to commute to work or to explore the city?.How does the weather change the way people use Bike Share?Analysis and VisualizationPersonally, I prefer to create a new Jupyter notebook for analysis only.
In the new notebook, I first imported the libraries and the cleaned data, then created new pandas Categorical datatypes for the day of the week (Monday, Tuesday, etc.
) and month names to ensure fixed sorting order (useful when visualizing the data).
Next, I transformed the data to make analysis and visualizations easier later on.
These transformations include:Renamed fields for clarityParsed out quarter, month, date, day of the week, and hour from the trip start timeGenerated a new route_id for each trip in the format of “start station ID-end station ID” to identify the route for the tripsNext, I also calculated the distance between the start and end stations.
This added another dimension that may be helpful later.
Who is using Bike Share?A common method to answer questions of “who” is to examine the demographic.
Initially, it seemed promising to estimate demographic based on the station locations.
For example, we might expect that stations located in neighbourhoods with a younger population be more popular.
However, the majority of the stations are located near subway/streetcar lines and are in the downtown areas.
Therefore the results could be skewed towards a specific demographic different from the real representation of the users.
Feel free to validate this using the Neighborhood Profiles dataset.
Since I couldn’t reliably infer any user information, I only had the user type (member vs.
casual) to work with.
I drilled down on the user types and visualized the distribution of trip duration and distance.
And the graph shows a distinct difference in behaviours between the members and casual users: while both groups generally travel within similar distances, members reach their destinations significantly faster.
I further broke down the difference in behaviours between these user types by looking at the when and how they are using Bike Share in the next sections.
When is Bike Share being used?I aggregated the data by the user type and trip date and counted the number of trips per day.
Then, I visualized the ridership trends across 2017.
While the graph showed some expected trends (eg.
more ridership in warmer months), there were also some less expected patterns that emerged:A shift in the peak season between members and casual usersConsistent rise and fall on a small time scaleI zoomed in and visualized only a few months of the data to get a better view on a more granular level.
While the trend is more distinct for July, the graphs showed that ridership increases and decreases on a weekly period.
To further tease out this cycle, I visualized the average daily trips for each weekday and separating the data by the quarter and user type.
These graphs further tell the story of how different user types cycle: members primarily bike on weekdays while casual users are mostly biking during the weekends.
The only exception was in the third quarter where there was a significant increase in casual trips on Wednesdays.
Adeeper dive into the data from those months shows that this actually only occurred in July.
A quick Google searched showed that this was because Bike Share offered free services on Wednesday for July in 2017.
Finally, I also wanted to take a look at the ridership trend across the day by the hour.
Again, this graph shows distinct differences in behaviours between members and casual users: members experience rush hours around 8 AM and 5 PM while casual users are much more consistent throughout the day.
This — combined with previous findings that members primarily use Bike Share over the weekdays — suggests that members usually use Bike Share as a method of commute while casual users are using Bike Share for leisure trips.
How is Bike Share being used?The analysis in the previous sections already answered a few of the questions I had on how Bike Share is being used (eg.
Instead, I decided to focus on the relationship between weather and ridership in this section.
I took the daily temperature and precipitation data from the Government of Canada’s weather station data from the Toronto City station.
First, I merged the ridership data with the weather data and created a pairplot to understand the relationship between the variables on a high level.
In addition, I created a heatmap to visualize the Pearson correlation coefficient between the variables.
While I was initially surprised at the low correlation between precipitation and ridership, the result makes sense.
The data only provides the total precipitation for each day at the specific weather station location.
However, it rarely rains a consistent amount throughout the entire day and evenly across all Bike Share stations locations.
Since ridership varies depending on the hour, I would need the precipitation data on an hourly level and from multiple weather stations to understand the relationship between these variables.
Feel free to explore this topic using the rain gauge data, for the rest of this analysis I focused on the temperature instead.
I visualized the ridership and temperature across time as a dual axis plot for members and casual users.
The fact that people are biking more in the warmer months is a bit of a “duh.
”, so I wanted to go a step further and find the temperature that is the turning point for people to decide whether they will bike or not.
I decided to only look at the casual users since they showed a higher correlation with temperature, but the same idea would apply for both user types.
I had some difficulties when trying to define this point.
Intuitively, it seems like there must be a way to determine this mathematically, however, many industries determine this point simply based on empirical observations over time.
In the end, a colleague of mine found this article which suggests a general approach based on the curvature of the plot.
It turns out there is already a Python library using this application: kneed.
The kneed library requires the data to be curve fitted before determining the point.
Given the data, I fitted the data as a linear, exponential without a shift in the Y-axis, and exponential functions and then evaluated the results.
Only the exponential growth without shift led to a curve where a knee point can be determined (the other 2 curves were too linear).
And the result showed that this point is at 16.
I can double check this value by using percentage cumulative trips (ie.
calculating x percent of the trips took place under y ºC), where the curve fit function would be a logistic growth with a maximum limit of 100%.
This method showed a point of 16.
Using a similar concept, I wanted to find the point where ridership stops growing with increases in temperature.
Intuitively, it would make sense for users to stop caring about the temperature once its “warm enough” (eg.
the difference between 22 ºC vs.
23 ºC would not be the determining factor for a user to decide if they will bike).
To find this point, I simply applied the same calculation to the section of the curve greater than the initial point, and I found that this second point is a 21.
I would be interested in using this concept on data from a warmer city where the average daily temperature might reach 30+ ºC, and see at what temperature the heat inhibits ridership.
Conclusion and Next StepsWhile this dataset itself is fairly simple, containing only a few basic columns of data, I was able to determine a number of small insights by merging the data with other public datasets.
During our conversation with TPA, they mentioned that despite their efforts in presenting Bike Share as the method to travel the “first and last mile”, the general public often sees Bike Share as a program for promoting tourism in Toronto.
The behaviours identified in this analysis show that the majority of Bike Share users are, in fact, commuters who bikes on a regular basis.
While the user information is not publicly available to enable user behaviour analysis, it could be worthwhile to further dive into the geospatial aspects of the Bike Share program.
For example, linking the number of TTC/GO stations (or other public centers) to the popularity of Bike Share stations, or further consider the route and availability of bike paths between stations.
I hope that this data story provides an introduction to data analysis using Python and identifies some tools that you might want to consider when doing your analysis.
If you have any questions, feel free to contact the Open Data team at opendata@toronto.
ca, and I encourage you to perform your own analysis using Open Data.