Analyzing Modes of Transportation in New York CityRehan RasoolBlockedUnblockFollowFollowingMay 10This work was done by Campbell Weaver, Yuzhou Wu and Rehan Rasool as part of a Data Science course at Cornell Tech.
IntroductionForecasting the number of riders at a given New York City subway station is a key challenge for the Metropolitan Transportation Authority (MTA).
Thus, we aim to solve the following challenge: Given a day of the week, a station location and additional data (e.
weather, special events), how many people can they expect?The purpose of this analysis and prediction is to prepare the MTA for unforeseen circumstances.
For example, during the recently held cherry-blossom festival on Roosevelt Island, there was a big subway congestion on the stop, which put many people in danger of suffocation or stampede.
Using our machine learning model, MTA would be able to predict the number of people expected at each subway stop, during different weather conditions, accounting for special events.
This would allow them to make preemptive arrangements to avoid any dangerous situations.
This work is focused on data from year 2018.
AnalysisData selectionSubway data obtained from MTA Turnstile Data.
Although this data was more complexly captured than what we needed, we were able to perform thorough data cleaning to make it useful for our analysis.
This data consisted of entry and exit counters for all NYC subway station turnstiles, captured after every 4 hours.
We used this dataset to analyze the impact of weather and special events on subway usage.
Dataset size: 10 million rows, 11 columnsLocation data obtained from Google Maps API.
This was used to extract latitude and longitude for stations obtained from MTA data, as that only contained the station names and not their corresponding location coordinates.
CitiBike data obtained from CitiBike Trip Histories.
The dataset consists of information of each bike ride during 2018, including trip duration, start & end time, date and station, station location (Lat/Long) and some user information.
We used this dataset to analyze the impact of weather and special events on bike ridership.
Dataset size: 11 million rows, 7 columnsWeather data obtained from the National Center for Environmental Information.
The dataset consists of climate data from different stations in New York city in 2018, including station location (Lat/Long) and various climate parameters, for example, temperature, precipitation and snow amount.
Dataset size: 28000 rows, 33 columnsEvents data: Special events that draw large crowds (and thus lead to subway ridership spikes) come in many different forms.
As such, it is extremely difficult to find an existing data set that contains historical information regarding all of these events.
In our analysis of the effect of events on transit data, we focus in on the relationship between ridership at 161st St.
Yankee Stadium station and the Yankee’s home schedule.
In order to get information on the Yankee’s schedule, we used an open source python wrapper to access the MLB’s API.
Data CleaningSubway data was cleaned using the following steps:Aggregate entry and exit data per turnstile per dayAggregate entry and exit data per station per dayRemove outliers using z-score >3This resulted in a cleaned dataset, comprising of 136000 rows, which is a huge reduction from 10 million that we started with.
Location data was used to fill in each station’s location coordinates (latitude/longitude) using station name.
CitiBike data is quite large because each row represents only single ride record.
In order to save time and space, we went through the data once and collected each station data.
We stored station names, ids and locations to dictionaries for later use.
Thereafter, we compressed the data by counting daily ridership and averaging ride duration.
As a result, we get daily data of bike usage per station per day.
Weather data was cleaned using the following steps:Extract NYC data onlyFill in TAVG columnFill in missing values as 0lowercase columns to stay consistentThis resulted in a cleaned dataset, going down from 28000 to 365 rows, one for each day of the year.
Data AnalysisDoes rain impact Subway usage?distribution of subway usage for rainy vs non-rainy daysNull Hypothesis (H0):There is no difference between subway usage on rainy vs non-rainy dayAlternate Hypothesis (HA):There is a difference between subway usage on rainy vs non-rainy dayThere are two distributions per category, which we guessed to be related to weekend vs weekdays.
We confirm that using the following plot.
It looks like there are two separate distributions for rainy vs non-rainy but we cannot be sure if that difference is due to chance or if it is statistically significant.
Therefore, we perform non-parametric Mann Whitney U-test on this data as the samples are independent and non-normally distributed.
The p-value came out to be 0.
0218, which was less than our p-critical-value of 0.
Therefore, we can reject Null Hypothesis and conclude that there is a difference between subway usage on rainy vs non-rainy day.
We confirm this difference by plotting subway usage heatmap on a rainy vs non-rainy day.
Does rain impact CitiBike usage?This was an easier question as instinctively people are less likely to use bikes on a rainy day.
We confirmed this by plotting CitiBike usage heatmap on a rainy vs non-rainy day.
Does subway and CitiBike ridership change per month?To answer this question, we plotted subway and CitiBike usage on a monthly basis.
It looks like subway ridership is the lowest during hot months, as people avoid subway during those days and prefer to bike or walk instead.
How is subway usage impacted by special events?It is intuitive that ridership would increase at subway stations near to an event venue on the day of a popular event, such as a sports game or a concert.
In many instances this effect is extremely hard to quantify, as numerous subway stations can service the same venue, thus spreading the effect across different transit hubs.
Furthermore, many one-off events, such as concerts, produce only a single day worth of data and repeat very infrequently, leaving us with small sample sizes on which to build our intuitions.
Finally, accumulating historical data on the many different types of events and venues across New York City is a daunting task.
We chose to analyze the relationship between ridership and events by focusing on the effect of Yankee’s home games on subway ridership at 161st St.
Yankee Stadium (the only convenient subway stop near Yankee’s Stadium).
By plotting a year’s worth of ridership data from the station a couple of things become immediately evident.
First, there is a regular periodicity where ridership is higher on weekdays (around 10,000 per day) and lower on weekends (around 5,000).
Suddenly, around April 2nd, on the day of the Yankee’s home opener, daily ridership starts to become interspersed with regular spikes above 15,000 riders.
This continues until the Yankee’s season ended in October, at which point ridership returns to previous levels.
By separating our station data into days when there is a Yankee’s home game and days where there is not we immediately see a difference.
When there is a home game the average daily ridership is 18,000 and when there is not ridership drops to 10,000.
By looking at the distribution of ridership at the station when there is a Yankee’s home game, versus when there is no game it is clear these come from distinct distributions.
Similarly, we analyzed the Roosevelt Island subway traffic over 2018.
There is a clear peak in the subway usage on April 21, 2018, which corresponds to the Roosevelt Island cherry-blossom festival.
Prediction ModelWe used two machine learning models to predict ridership of stations based on location and weather information:Linear RegressionRandom ForestWe used the following measurement techniques to evaluate performance:Mean Absolute Error (MAE) — the mean of the absolute value of the errorsRoot Mean Squared Error (RMSE) — the square root of the mean of the squared errorsPerformance of different modelsFrom the table we can see that the Random Forest model has a better performance compared to the Linear Regression model.
As the influence of station location to the ridership cannot be expressed linearly, it is difficult for a linear model to predict correctly.
To visualize these predictions, we created heatmaps of the predicted values compared to the ground truth.
We can clearly see that the Random Forest model outperformed the Linear Regression model:We also examined the importance of different features on the ridership.
The longitude and latitude have greatest impact on daily ridership.
This makes sense as different neighborhoods have different populations, which is the most important factor on public transportation usage.
The average temperature indicates a generalized weather condition and the season of the year.
Therefore, it is also an important feature.
During weekdays and weekend, the transportation usage differs, shown as day of week.
Precipitation also influences ridership, but compared to other factors, it has less impact.
Further ImprovementsThe following steps could be taken to improve this work:Use more advanced machine learning models with richer features, for example, we may include events information as part of the features to get better prediction resultsInclude Taxi and Uber/Lyft data to analyze how they are affected by weather and eventsUse data from multiple years to capture year-to-year trends.