This is a key trade-off in production grade machine learning applications where on one end of the spectrum we’re optimizing for model performance and on the other end we’re optimizing for low latency application performance.
As I continued to test out the application’s performance, I still faced the challenge of relying on so many APIs for real-time feature generation.
Due to rate-limiting constraints and daily request limits across so many external APIs, the current machine learning classifier was not feasible to incorporate into the final application.
Run-Time Compliant Application ModelAfter going back to the drawing board, I trained a random forest model that relied primarily on scooter-specific features which were generated directly from the Bird API.
Through a process called vectorization, I was able to transform the geolocation distance calculations utilizing NumPy arrays which enabled batch operations on the data without writing any “for” loops.
The distance calculations were applied simultaneously on the entire array of geolocations instead of looping through each individual element.
The vectorization implementation optimized real-time feature engineering for distance related calculations which improved the application response time by a factor of ten.
Feature Importance for the Run-time Compliant Random Forest ClassifierThis random forest model generalized well on test-data with an AUC score of 0.
95 and an accuracy rate of 91%.
The model retained its prediction accuracy compared to the former feature-rich model, but it gained 60x in application performance.
This was a necessary trade-off for building a functional application with real-time prediction capabilities.
Geospatial ClusteringNow that I finally had a working machine learning model for classifying nests in a production grade environment, I could generate new nest locations for the non-nest scooters.
The goal was to generate geospatial clusters based on the number of non-nest scooters in a given location.
The k-means algorithm is likely the most common clustering algorithm.
However, k-means is not an optimal solution for widespread geolocation data because it minimizes variance, not geodetic distance.
This can create suboptimal clustering from distortion in distance calculations at latitudes far from the equator.
With this in mind, I initially set out to use the DBSCAN algorithm which clusters spatial data based on two parameters: a minimum cluster size and a physical distance from each point.
There were a few issues that prevented me from moving forward with the DBSCAN algorithm.
The DBSCAN algorithm does not allow for specifying the number of clusters, which was problematic as the goal was to generate a number of clusters as a function of non-nest scooters.
I was unable to hone in on an optimal physical distance parameter that would dynamically change based on the Bird API data.
This led to suboptimal nest locations due to a distortion in how the physical distance point was used in clustering.
For example, Santa Monica, where there are ~15,000 scooters, has a higher concentration of scooters in a given area whereas Brookline, MA has a sparser set of scooter locations.
Given the granularity of geolocation scooter data I was working with, geospatial distortion was not an issue and the k-means algorithm would work well for generating clusters.
Additionally, the k-means algorithm parameters allowed for dynamically customizing the number of clusters based on the number of non-nest scooters in a given location.
Once clusters were formed with the k-means algorithm, I derived a centroid from all of the observations within a given cluster.
In this case, the centroids are the mean latitude and mean longitude for the scooters within a given cluster.
The centroids coordinates are then projected as the new nest recommendations.
NestGenerator showcasing non-nest scooters and new nest recommendations utilizing the K-Means algorithmNestGenerator ApplicationAfter wrapping up the machine learning components, I shifted to building out the remaining functionality of the application.
The final iteration of the application is deployed to Heroku’s cloud platform.
In the NestGenerator app, a user specifies a location of their choosing.
This will then call the Bird API for scooters within that given location and generate all of the model features for predicting nest classification using the trained random forest model.
This forms the foundation for map filtering based on nest classification.
In the app, a user has the ability to filter the map based on nest classification.
Drop-Down Map View filtering based on Nest ClassificationNearest Generated NestTo see the generated nest recommendations, a user selects the “Current Non-Nest Scooters & Predicted Nest Locations” filter which will then populate the application with these nest locations.
Based on the user’s specified search location, a table is provided with the proximity of the five closest nests and an address of the Nest location to help inform a Bird charger in their decision-making.
ConclusionBy accurately predicting nest classification and clustering non-nest scooters, NestGenerator provides an automated recommendation engine for new nest locations.
For Bird, this application can help power their nest location generation that runs within their Android and iOS applications.
NestGenerator also provides real-time strategic insight for Bird chargers who are enticed to optimize their scooter collection and drop-off route based on scooters and nest locations in their area.
CodeThe code for this project can be found on my GitHubComments or Questions?.Please email me at: perryrjohnson7@gmail.
com.. More details