Predicting Bike-share users: Machine Learning + Data Visualization project (R)BackgroundThis started as an independent class-project I worked on for a Machine Learning course (EE 660) at the University of Southern California. I have thoroughly updated it as my understanding of Machine Learning (ML) concepts and practices have matured, and have transferred the code from Matlab to R in the process. I’m always looking to learn and improve my outcomes, and am open to any constructive feedback on my methods. The R script, the data-set and images I have used are all publicly acessible on my Github. If you would like the working code in Matlab, please email me. Without further ado, let’s kick into it.Defining the Problem and Project GoalA bicycle-sharing system is a service in which users can rent/use bicycles available for shared use on a short term basis for a price or free. Currently, there are over 500 bike-sharing programs around the world. Such systems usually aim to reduce congestion, noise, and air pollution by providing free/affordable access to bicycles for short-distance trips in an urban area as opposed to motorized vehicles. The number of users on any given day can vary greatly for such systems. The ability to predict the number of hourly users can allow the entities (businesses/governments) that oversee these systems to manage them in a more efficient and cost-effective manner. Our goal is to use and optimize Machine Learning models that effectively predict the number of ride-sharing bikes that will be used in any given 1 hour time-period, using available information about that time/day.Bike-share program bicycles in Washington DCData-set usedThe data-set we are using is from University of California Irvine’s Machine Learning Repository. The data-set compilers used information partially from a two-year historical log corresponding to the years 2011 and 2012 from Capital Bikeshare system, Washington D.C. This information is available publicly. The compilers aggregated the data on an hourly basis and daily basis (for anyone interested in digging further). Next they extracted and added all corresponding weather and seasonal information from here.Our data-set is a csv file (available on my Github) with information from 17,379 hours over 731 days with 16 features (information categories) for each hour. The features are:Record indexDateSeason (1:spring, 2:summer, 3:fall, 4:winter)Year (0: 2011, 1:2012)Month (1 to 12)Hour (0 to 23)Holiday : whether that day is holiday or notWeekday : day of the weekWorking-day : if day is neither weekend nor holiday , value is 1. Otherwise 0Weather situation : — 1: Clear, Few clouds, Partly cloudy, Partly cloudy — 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist — 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds — 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + FogNormalized temperature in Celsius. Values are divided to 41 (max)Normalized feeling temperature in Celsius. Values are divided to 50 (max)Normalized humidity. The values are divided to 100 (max)Normalized wind speed. The values are divided to 67 (max)Count of casual usersCount of registered usersCount of total rental bikes including both casual and registeredFrom an initial look, the data-points far exceed the number of features, which makes this a “skinny” data-set, considered ideal for ML.Exploratory data analysisBefore starting to process a data-set with algorithms, it’s always a good idea to explore it visually. We are going to use R for this project. Using the ggplot2 and ggextra packages, we can quickly make some plots to investigate how the bicycle usage count is affected by the features available. Now let’s look at some graphs.Scatter plot- Adjusted Temperature vs UsageScatter plot- Temperature vs UsageAs seen from the scatter plots above, there is a positive correlation between both temperature-to-usage and adjusted-temperature-to-usage for most of the temperature range, and a linear fit isn’t far from the best-fit curve. This should intuitively make sense, as people are not likely to bike outside in cold weather. For the maximum temperatures, which seem to be a small subset of the data, there is a dip in this curve. Once again, this intuitively makes sense as users may also be discouraged to bike when it’s too hot outside.Scatter plot- Humidity vs UsageThere seems to be a negative correlation between the humidity and the usage rate, with a linear fit being very close to the best curve fit for all of the data (excluding some outliers with very low humidity). This could be theoretically explained by the climate of Washington DC, which is notoriously humid. Higher humidity is correlated with higher chances of rainfall, and we could also hypothesize that added perspiration due to high humidity works as a deterrent to riding a bicycle outdoors. Based on the visualizations so far, it wouldn’t be unreasonable for us to hypothesize that the weather-situation will affect the bike usage with rain deterring usage. Our next graph, plotted below, supports this hypothesis, to an extent. Note from the histogram on the x-axis that there are many more clear days (weather sit 1) than overcast or rainy days (weather sit 2 or 3 respectively).Scatter plot- Weather Situation vs UsageScatter plot- Wind Speed vs UsageLooking at the wind speed data, however, doesn’t give us a clear interpretation of how it affects usage. The correlation between the two factors is weak at best. Below is a correlation-matrix of all the continuous variables discussed so far which adds some numbers to the trends we have observed.Note: Interestingly, the casual usage count is more correlated with the continuous variables and aligns better with our previous hypotheses. It makes sense if we think about it, as registered users who use bikes to commute to work are less likely to be deterred by uncomfortable weather conditions. We may conclude that it makes more sense to predict these two counts separately and add up the tallies to find the total count. However, when I tried that, I found the final predictions to be less accurate than what we get if we simply predict the overall count. So for the rest of the project, we will ignore the Registered and Casual count and work with just the Total count as our output. If you want to, you can access the data-set and my code on my Github and try including these variables to see if you can find better results. On to the next point. Let’s take a look at how time and date affects usage.Looking at the effects of time, it seems that the lowest usage is late at night (with the minimum between 4–5 am) and the peaks are during 8–9 am and 5–7 pm, unsurprisingly the DC times for rush hour. The fit is far from linear. However, with some simple data manipulation (more on this in the next section), we can change this to represent the usage rate based on the temporal distance to 4 am, and a find a somewhat linear fit (see below). Note: having features that linearly predict the outcome is ideal as it reduces the need for complex non-linear ML algorithms.A somewhat similar trend can also be observed in the month vs usage plot (below), with an evidently higher usage rate during the warmer months of the summer and the lowest usage during January. With some manipulation similar to the previous plot, this data can also used to represent usage based on the temporal distance to the month of January. This correlation, however, is not as strong as that for the manipulated time graph.Finally, looking at the “Year” variable (below), it can be seen that the usage rate goes up from Year 1 to Year 2, which could suggest that the system grew in popularity. One important thing to note when using this variable is that, while it may be useful in the scope of making predictions in the year range provided (2011–12), the algorithm would have to extrapolate significantly to predict how it affects future dates (2018 and onward), which could make this variable a less reliable predictor for a different time period.Pre-processing: Data-cleaning & Feature EngineeringFor any ML project, preprocessing the data is a crucial step. The process is usually an accumulation of widely-followed pre-processing practices, and case-specific tweaks made to the data based on judgement-calls from whoever is designing the models. If data-wrangling is done incorrectly or inadequately, it can lead to ill-trained ML algorithms that provide inaccurate or (at best) sub-optimal results. The age-old saying, “Garbage in, Garbage out” applies here. An important part of this process is Feature Engineering, which involves turning available data-features into more useful variables that can help us predict our outcome. Let’s walk through the steps we are taking:Using prior knowledge to remove features that don’t add important information, which in this case is only the “index” feature.Extracting the week number from the date. The date itself, in the format provided, isn’t something MATLAB can process in our algorithms. From this date, however, we extract the week number (for that particular year) and use that variable as a predictor for the count.One hot encoding, which is the process of splitting up non-binary Categorical features (Month, Week-number, Hour, Weather-situation, Season, Weekday) into multiple binary sub-features where each sub-feature indicates whether a certain category for the original feature is True or Not (1 or 0). If possible to do so without making the data-set “fat”, it is best to split each multi-classification feature up into multiple binary features. Since we have a massive data-set with 17,000+ data points, we can expand these features and still have a “skinny” data-set so that we don’t risk over-fitting. Note that this does significantly increase the dimensionality of our data-set and increases our computational time by a factor of greater than 5X. Since we are using only non-complex regression algorithms in this project, we can afford to add computational complexity for better accuracy in our predictions.Modifying cyclic variable values to represent temporal distance from a single time-point. As we saw in the data-visualization section, for cyclic variables (Month/Week-number/Hour), trying to find accurate linear patterns across the changing values can be difficult. The best linear model fit across weeks 1:53 may have a very different value for week 1 in comparison to week 53, despite the data points in week 1 being temporally very close to those in week 53. We fix this by changing the values to represent the temporal distance from a fixed time-point instead. Based on what we discovered from our Exploratory Data Analysis section, it is reasonable to set the base time-points at the minimum-usage times. Therefore the temporal distances we use are calculated from 4 am for Hours, and from mid-January for the Weeks and Months.Normalizing the data-set. Without normalization, features with larger values could have a disproportionately large impact on our training algorithm. We will notice all the features in our data-set except for the cyclic variables discussed in point 4 are already normalized to be between 0 and 1. So we normalize these 3 features with the min-max normalizer where z= (x-min(x))/(max(x)-min(x))We end up with 114 features to predict the hourly bike-usage from 17,379 data points.Applying Machine Learning algorithmsFor this project, we are going to use two of the more well-known Regression algorithms: Maximum Likelihood Estimate and Maximum A Posteriori. For linear regression, we are interested in finding the best parameters/weights, w, such that given our features, X, our predicted outcome, Y_predict = X * w, is as close as possible to the real outcome, Y_test. Packages in R and Python allow us to conveniently apply ML algorithms with two or three lines of code. While that approach has it’s advantages, I think it’s also important to understand the fundamental statistics and probability concepts behind our ML algorithms to understand our results better. Unlike more complex models like Neural Networks, the algorithms we will use are are relatively much easier to understand. So I will very briefly go over what each of them are doing and if you want, you can go over my commented code on Github to follow exactly what we are doing step-by-step.1) Maximum Likelihood EstimateMaximum Likelihood Estimation (MLE) is used to estimate some variable in the setting of probability distributions. Let’s say we have a Likelihood function, P(D|w). This is the likelihood of having our overall data-set, D, given a certain w that fits the data-set features, X, to the data-set outcomes, y. When we perform MLE for w, the parameters we are trying to infer are:w_MLE = argmax w P(D|w)In other words, we want to find the w that maximizes the likelihood, P(D|w). From here, we can do a little bit of Linear Algebra, come up with a cost/loss function that we need in order to calculate the best weights, and then minimize this function using derivatives (recall Calculus) to find the best weights. We will find that the best parameters/weights can be found using something called the Ordinary Least Squares (OLS) method summarized in the formula below:Ordinary Least Squares formulaUsing OLS, we find the best parameters on a subset of our data called the training-set. Then we test these parameters on a different and separated subset of the data called the test-set in order to see how our prediction values, yMLE, compare to the real outputs, yTest.I have intentionally sped through the mathematical steps here, because: a) there are free online resources that explain each of these methods in depth and better than I can, and b) we are ultimately more concerned about the applications of these algorithms than the math behind how they work. Note that you will require a solid understanding of Calculus, Linear Algebra and Probability Distributions to understand either MLE or MAP on a deep level.2) Maximum A PosterioriMaximum A Posteriori (MAP) is a product of Bayesian Statistics, a field of statistics based on Baye’s theorem:Unlike the Frequentist Inference statistics that we have been dealing with so far for MLE, this view assumes that we have some useful prior knowledge about the distributions. As the name suggests, this works on a posterior distribution, not only the Likelihood. From the formula above, we can derive the posterior distribution to be defined by the formula:P(w|D) = ( P(D|w)*P(w) ) / P(D)Assuming that P(D) or the distribution of our data-set stays constant, we can further conclude that:P(w|D) ∝ P(D|w)*P(w)The first part, P(D|w), is the Likelihood term we were dealing with before. And we assume that P(w) follows a Gaussian distribution such that:weights/parameters are assumed to follow a Gaussian distributionSince we don’t know the m_w or the Tau-squared terms, we try a thousand different combinations of the values for these terms using 2 for-loops to test our algorithm on our validation-sets, sets best defined as subsets of our training-set that are used for fine-tuning parameters/weights before we test the parameters on the real test-set. Note that we are in effect creating a regularization term, defined by λ= m_w/ Tau-squared, and testing out what values of λ give us the best performance. Following the Linear Algebra, Calculus and Probability steps we outlined for MLE, the w for each training sub-set is calculated by the formula:After fine-tuning the parameters on the validation sets, we use them on the test-set features to predict the MAP outcomes and compare these predictions to the yTest values, similar to what we did in MLE.Evaluating Outcomes and Model(s)’ PerformanceScatter plot of MLE predictions for y VS real yTest valuesScatter plot of MAP predictions for y VS real yTest valuesSummary statistics for evaluation:MLERun-time: 0.39 secondsMedian prediction error: 34.70Mean prediction error: 62.35Mean Squared Error : 9264.41R-squared value : 0.703Correlation between Predictions and Real(Test) values: 0.844MAPRun-time: 168.09 secondsMedian prediction error: 35.21Mean prediction error: 62.47Mean Squared Error : 9301.64R-squared value : 0.702Correlation between Predictions and Real(Test) values: 0.843Examining how MLE and MAP predictions vary from the real data using a sub-set of the Test dataDiscussionAs can be seen from the plots and the values above, moth MAP and MLE give us very similar results, as is usually expected when we are working with a data-set of this size. We find that we have median prediction errors of 34.7–35.2 for an overall range of user counts that extend from 0 to almost 800, The mean squared error (MSE) for our predictions ranged in the low 9000’s and the R-squared value for both models was approximately 0.70. We can also notice that there is a strong correlation in our data between actual count and predictions in both plots. The best fit lines through the scatter plots are very close to what would be perfect, a line with gradient of 1 passing through the origin. Note that none of our predictions are less than zero because we assumed that the usage count follows a Poisson distribution and we used a Poisson Regression model. Investigating a little bit further, let’s look at a subset of our test-data which has just 100 data-points. We see that both MAP and MLE predict the user-count very well when the overall usage count is low, but not so well for the 4 instances when the actual usage count was over 175. These points lead to outliers with large errors which explains why our mean error is so much larger than our median error.Based on these results, it is probably safe for us to conclude that we have created effective models for predicting the number of bike-share users per hour. There is certainly room for further improvement, especially for hours when the number of users is exceptionally high. So if you have suggestions on what could be done to improve these models, please feel free to let me know. That concludes our work here. Thank you for reading.If you made it this far, I hope you enjoyed reading this as much as I enjoyed writing it. I also hope you know more about Exploratory Data Analysis and Machine Learning than you did before. If you think this may me interesting or educational to others you know, please do share it with them. If you liked the article, and had something you wanted to share with me, feel free to comment, contact me via email at nadir.nibras@gmail.com or at https://www.linkedin.com/in/nadirnibras/. I am committed to improving my methods, analyses or data-sets, so if you have any suggestions, feel free to comment or let me know otherwise. If you want to follow more of my work on data science, follow me on Medium and Linkedin.