How We Finished 3rd and Still Won a Data Science CompetitionWe had a lot of work to do — even after the final scores were published.
Julio Cezar SilvaBlockedUnblockFollowFollowingJul 2Listen to Unboxing Data Science on SpotifyIt was a first-timers duo versus veteran scientists worldwide, so you can imagine how it ends.
A few months ago, we were just two students building up Data Science knowledge and striving to enter the field professionally.
At the time, Victor was a Data Scientist Intern at IBM Research and I was a Software Architecture Analyst at Accenture.
Our common goal led us to a competition search that was increasingly frustrating, since it always boiled down to places where wins are purely technical, and things like leaderboard probing become commonplace.
The opportunity to test our skills beyond pure technique was found only a month later: the EY NextWave Data Science Competition.
It tackled urban mobility problems in Atlanta, challenging participants to predict whether trajectories would end in the city center, based on geolocation data.
And this one was different.
It wasn’t another hard-skill-only challenge, as building a top-scoring model was only half the battle: Rank leading scientists had to present their methods and research background to a board of judges from highest positions at EY.
Use of external data, EDA robustness and problem-oriented approaches were among the main criteria.
Here we give a Strategy Timeline on how, by focusing on solving the problem underneath, we turned our 3rd place model in the winning project.
Understand us fully by following the full code on GitHub alongside.
How We Analysed DataAs has been pointed out, our work methodology was problem-oriented rather than competition-oriented.
Strictly speaking, this means we have managed our time using a business-centric Data Science process to understand the problem, explore data, extract features, train models and analyze the results — enabling us to implement this solution in a real environment if desired, not just producing a score that might be useless afterwards.
Our Data Science CycleThe first major step was deeply understanding what we were trying to solve.
There we made an exhaustive search in literature to clarify how people usually approach the problem and which tools are used.
We had limited time available to finish the competition and also work on data, making this phase extremely important to define and narrow our focus on feasible alternatives, rather than spending time on techniques that might not prove useful within our investable time.
Upon starting our Exploratory Data Analysis phase, we made a series of visualizations to get a sense of overall data distribution, while trying to check for inconsistencies and problems on data generation.
We then hit a major concern: more than half of our data were zero distance trajectories, i.
their entry and exit locations were the same.
This imposed some hardships to overcome in modeling, and would not be possible to identify without an extensive analysis.
In the end we made it an advantage — the count of 0-distance trajectories in a person’s journey would later become a feature.
A pair of 0-distance routes coupled with normal onesMoreover, data supplied by organizers came in Cartesian projections rather than usual latitude/longitude values.
Translating them into the latter format would be a big advantage to our EDA and Engineering processes, though we initially lacked the know-how.
Fortunately, through deeper analysis and research we discovered how to transform all values using Mercator Projection and the pyproj lib in python.
Visualization of most common trajectory paths using KMeans and Google Maps APIAfter having explored and discussed the spatial complexities of our data, the understanding of their distribution over time was crucial.
Like any human activity performed routinely, the time when these events occurred holds a generalized influence into how they occur.
Effectiveness was key.
This point in our analysis had to fill two needs with one deed: 1) asserting consistency between train and test set distributions, and 2) providing essential visualizations through time.
Both are shown in the below bar plots, but only the latter generated discussions.
From plot #1 we can observe a first local maximum in average distance traveled, happening at midnight (0h).
Afterwards, there’s an upward trend reaching global maximum around 6AM.
Judging by the time on them, we concluded those travels would be routes taken to and from work — even the midnight peak can be an indicative of night shifts ending.
And while the first visualization was enlightening, the second one seemed a worrisome complexity.
The number of people inside the city center breaks maximum records constantly, starting from 6AM until 3PM, and our prediction target period was 3PM to 4PM.
This means our target could, as trend indicates, contain the global maximum or, in a complex twist, a first major drop in those numbers — a pattern that’s yet unseen in the available data.
We had to give a much higher depth to our models for them to accurately predict the target.
How we Invested in Feature EngineeringOur largest volume of work was put on the effective growth of this section.
We wanted to climb the ranks with features backed by both research and analysis, so the best performing features would be the ones we know best.
Since our problem was on Trajectory Analysis, we had to know how far any point was from the center and from trajectory entries / exits.
So after our first round of research, we had selected the three distance formulas that made the most sense to us.
Our main distancesHaversine distance is fundamentally different from the others — it considers the spherical form of a surface — and demanded us to have mastered the Mercator Projection we studied back on EDA.
At the moment, we had already created an extensive group of distance features, like the distance to center vertexes, to origin and previous entries.
Further analysis led us to understand not all travelers in our data had a cleanly connected journey.
Spread through our train and test sets, a high number of people started moving from places different from their last stop.
A strong feature later on, we called these gaps Blind Distances.
Blind Distance drawn — the 4th entry here is disconnected from the 3rd exitYet this feature didn’t surface so high in importance until calculated with the Haversine formula, instead of Euclidean.
Seeing how differently each formula played out, we decided to calculate every distance feature thrice — in Euclidean, Manhattan and Haversine form.
There’s a clear correlation between those three, yet each of them play a specific part in our final model.
Our strategy aimed in maximizing the number of identifiable patterns, by combining distance formulas and also performing ( min, max, std, …) aggregations on all features.
This way we also took advantage of the robustness of Gradient Boosting algorithms — such as LightGBM — for large amounts of features that may be highly correlated.
No distance analysis alone can provide a complete perspective on a spatiotemporal problem.
A clear understanding of the role of time is paramount, and we sought it by connecting the biggest insights from EDA into features such as period of day, a diverse amount of deltas, and continuous representations of both hours and minutes.
A continuous representation of time (15h30 = 15.
5) and time delta w.
t originNow further within the analysis of a single route, applying central aspects of geometry was undoubtedly important for our classification, in particular the analysis of angular features.
Having a numerical scale of the direction of any past trajectory ultimately set us apart a major plateau.
Trajectory direction & angles to city center’s a) midpoint, b) vertexes.
Wrapping up our Feature Engineering reach, towards the second half of the competition there was need for some kind of memory management in our models — previous points in a journey could help reveal continuity patterns.
A pattern that’d become clearer here is that of people who live in the city center, but work far from it and return home in the end.
Remembering some of their past travels would surely help.
Since LSTMs didn’t perform nearly as well (or as fast) as LightGBM did for us, we proceeded to convert our datasets to sequences.
Sequence format for the T4 row of ID #1Like this, every row keeps record of previous trajectories of a given person.
To avoid overfitting, we benchmarked different window sizes on which to limit how many trajectories are kept in a sequence — reaching an optimal window = 6 value for the final submission.
Learning from Systematic MistakesIn the middle of the competition, we’d been hitting a major plateau, not surpassed by any of the attempts we made for over a week (which is a lot in a competition that’s 1-month long).
The evolution halt led us to take on a more matured approach, and start continuously learning from our models’ mistakes by performing Residual Analysis.
In short, it’s a process by which you track and analyze two opposite samples: correctly predicted data points, and mistaken ones.
Our goal here was confronting the distinction between these groups in every reasonable way, identifying the biggest patterns within our mistakes, and laying out any optimizations to minimize them.
Layers of interpretation led us to concluding our model was correct about most trajectories which either stayed inside, or outside Atlanta’s center.
What it mostly always got wrong were the ones exiting and entering Atlanta.
Shortest-distance wrong predictions (left) vs correct (right).
Purple rectangles are Atlanta’s center area.
One can clearly see the spatial sparsity difference above, and how strong is the interaction of wrongly predicted data (left) with the city center.
After realizing this was the main failure in our predictions, we set out to create an essential group of features:And the plateau was overcome.
If we hadn’t stopped aimlessly trying to improve, or hadn’t actually assessed our weaknesses — we’d probably still be fighting that performance halt.
Going Beyond Given DataUsually, competitions let you thrive by squeezing score points from given datasets as well as you can.
Victor and I are against that as the main practice, because it fails to mirror what happens in real-life projects.
Real life rarely — if ever — gifts you a perfectly good dataset for that cool new project you just came up with.
You usually have to research, scrape, collect and clean the data you want to use.
This introduces a layer of difficulty usual competitions don’t prepare you for.
Yet here, the use of external data was encouraged in their very guidelines.
We saw that as a chance to completely test our resilience in Data Science, so we promptly took it.
Searching and actually finding useful data was not easy — outside the competition bubble there’s a world full of papers and research using data that’s different from your intent, or not publicly releasing the data you want.
Several reads in, we found a traffic related study for the state of Georgia, and it contributed what would become one of the most important groups in our feature space.
Validation and Training StrategiesAnother point of interest that made this competition feel like a real project was that organizers did not give a label which we could use to predict our target variable directly.
That is, it was up to us to choose supervised (classification, regression) or unsupervised learning.
The first obvious choice is to predict positions in x and y applying linear regression to then infer if it ends inside the city center.
However, we met a few caveats to consider using this strategy: we would not be minimizing the true target, and error metrics for predicting two different variables didn’t consider the distance between entry and exit points, rather calculating x and y independently.
With these obstacles in mind, we have decided to facilitate our work applying binary supervised classification.
Sketch of our supervised approach.
Next, we needed to define our validation method.
From competition rules, 30% of data supplied was reserved as test set, i.
the data where organizers would calculate scores to define final ranks.
To create a consistent validation and not overfit on the test set, we split the 70% left of our data into train and validation sets to assess our model’s results.
In order to do hyperparameter tuning, we used Cross Validation K-Fold for k = 5 using only training data — this way we’d avoid overfitting our validation set.
Is there more to it?Yes.
This article is mainly a summary on part of all work we put in, and currently the most complete reference is our GitHub repo.
Winning was a result of our rationale, but also the algorithms and scores that came from it, so understanding the depth here takes a further study on what’s implemented.
Alternative ApproachesNo win is indicative of perfection.
Being a conscious data scientist, in our vision, means being aware of the shortcomings of any strategy taken.
Here we intend to clearly state a few good steps towards the best model our approach enables.
By proposing the maximization of identifiable patterns with distance formula combination, feature aggregation and sequence format conversion, we also maximized our feature count.
LightGBM has optimized performance and by the end of the competition we set AWS Sagemaker for most training and tuning, but one could better spend time and resources by applying a dimensionality reduction strategy such as Principal Component Analysis (PCA).
Computation time and load are the most precious resources of any iterative process, let alone Data Science pipelines, so this optimization comes first in our minds.
Also, it is noticeable that we used clustering as a guide in our EDA phase, but we did not explore it as well as we wanted in feature extraction.
We could have created Regions of Interest (ROI) in our data using DBSCAN, for example, as demonstrated in the following plot.
Regions of Interest using DBSCAN.
Even better, instead of modeling data as a binary classification task, we could’ve used these regions to create different places for predictions and increase our range of possible values.
Another approach not deeply explored was using regression techniques to predict x and y position directly, as illustrated below.
Predicting x and y positions directlyHowever, all the aforementioned approaches had their drawbacks as well.
Using clustering techniques to create new features would impose some difficulties regarding the temporal aspect of data — we can not use points in a near location but different times, to see how many neighbors the point has (or any other feature we might want to extract).
Predicting x and y positions didn’t perform as expected because we didn’t try to minimize the distance between true and predicted point of arrival, but rather x and y independently, which impacted our scores negatively.
And of course, since we focused significant amounts of time on feature engineering, we did not have enough time to explore alternative approaches with scientific methodology.
Lessons LearnedA Data Science project in teams is far from being like group projects in college.
Even in Google Colab, you still can’t collaborate in real time like in Google Docs — and maybe you shouldn’t.
We tried doing so, thinking things would be done faster if we’d team up in a given model, but now we know that Jupyter Notebooks are set out to tell a story with code.
Having two minds editing that same story and code at the same time could create many broken, confused results.
We ended up doing the essentials (EDA, Feature Engineering…) together but in different Notebooks so we could compare our individual, unbiased insights afterwards.
Further topics (Residual Analysis, Clustering…) were parallelized.
There’s no all-time perfect approach, this one worked for us because (another lesson) we focused on designing our specific workflow in the first moments, instead of jumping into things right away.
Throughout the competition we felt the increasing need to compare and benchmark our past submissions.
For that we even tried keeping copies of submitted notebooks dumped in a separate folder.
Of course, that was unsustainable since we had 100+ submissions done and towards the end only kept high-scoring ones.
There was little tracking of architecture benchmarks (LSTM vs Gradient Boosting vs …), and that was exactly what we needed done best to decide our final submission and stacking strategies.
It’s ironic that we only realized after competition’s end, but think about it: we used git, one the most widespread versioning tools of the planet.
Its main purpose is to enable file history navigation.
We could’ve created git tags for each submission, following a naming pattern such as architecture/public score for better organization.
Creating tags would be as simple as doing git tag -a lightgbm/882, and to inspect its files we could just git checkout lightgbm/882— just like you do with branches.
The organizers kept a history of submission scores, and by the end they showed all public scores converted to private.
But since we didn’t track every submission, we couldn’t know what architecture/change was in a given score.
856 an LSTM?.Did we remove feature aggregation in 0.
788?.If we had created tags, this article could have complete public & private score benchmarks by now, easily gathered by matching the date when tags were created with dates in EY model history.
We had much pleasure in making this project our first large effort in Data Science, for a competition that prioritized our exact values: science in the role of solving problems, and communicating your solutions to other people, rather than just building up scores.
After all, complex problem solving is a collaborative effort, beyond the boundaries of pure technical prowess.