One concept that beautifully captures the level of randomness in a sequence of events is found in the domain of information theory and is known as entropy.
Shannon entropy takes into account the distribution of the events in the sequence, and provides a value representing the “expected surprise” or uncertainty enclosed by a random variable (the sequence of visited locations in our case).
When all values of the random variable have the same probability of occurrence, then Shannon entropy reaches its maximum value, also known as Hartley entropy.
For our model, we included the raw value of Shannon entropy as well as normalized by the Hartley entropy for the sequences of visited and stop points aggregated per day, week, weekday, weekend and month (points where the individual stopped for more than 30 minutes).
Hierarchical ClusteringIn this specific problem, one key point in being able to augment the data set with external sources is to be able to filter down the number of data points.
Indeed, as mentioned earlier, logs span more than a year in average and data points are recorded about every 30 seconds (in the best case) which leads to millions of data points for each driver.
Due to limits on the number of calls to the different APIs and the time it would take to process the HTTP request it is not feasible to obtain the information for every single points in the dataset.
The question is then how to select the best points for which fetch this information?The way we solved this particular issue is through clustering.
However, there exists a wide variety of clustering algorithms, so the next question is which one to use?Needless to say, that having 2,000 different drivers, scattered through Spanish geography, and carrying different data collecting devices leads to having different locations histories (not sure what that means), where the clustering need to make sense.
For that reason, it was a challenged to work with algorithms depending on predefined parameters since these needed to be adjusted to each different case.
For this reason we discarded plain clustering techniques, such as k-NN because we cannot know the final number of clusters.
Density-based clustering algorithms were discarded as well because they turned out to be quite sensitive to the hyperparameters.
To illustrate this, we ran some tests for which we show results in the images below.
In the left image, we see the result of clustering the trips of a user using HDBSCAN, a density-based algorithm, with a minimum number of points per cluster equal to 2, and a minimum distance equal to 10, whereas in the right image the parameters are changed to 10 and 10 respectively.
As can be seen, there is a huge difference between the results, just for this particular driver and area of a city.
Thus, density-based algorithms were discarded since it would require tuning a different set of parameters for each driver in every scenario.
Results of clustering a single driver trip with HDBSCAN with two different parameter valuesFinally, we turned to hierarchical clustering instead.
This family of algorithms needs just the maximum distance between points of a cluster to perform their task, which we set to 100 m for all drivers, since this is the granularity we wanted to reflect the trip of drivers.
No other parameter needed to be set, nor final number of cluster or number of points per cluster.
Among the different hierarchical clustering algorithms, we focused only on two of them due to memory constraints: CLINK and SLINK.
The main difference between the two is how to calculate the distance between clusters: SLINK measures the distance between the closest points of two former clusters to decide whether to merge them or not, whereas CLINK measures the distance between the furthest away points.
This way, CLINK is better suited for our purposes so that we are sure no two points are furthest away than the threshold (with SLINK, the closest points of the two merged clusters are below the threshold of 100 m, but we have no clue of what happens with the rest of the points of both clusters, which could be much further away, thus covering different streets or roads with different speed limits or road type).
The image below illustrates the kind of results we can achieve using CLINK for our problem.
The image on the left hand side represents the clustered positions of a specific driver while the one on the right hand side is the original image with all the collected positions.
One can see that the clustering allows us to considerably reduce the number of points with minimal loss of information as all the trips and roads taken can still be identified.
Clustering results using CLINK.
On the right the original measures, on the left the clustered ones.
Using CLINK algorithm allows us to cluster nearby points diminishing the redundancy while preserving different granularity at urban level (closer data points are needed to profile the displacements across the city) and at highway level (changes are less frequent both in speed and acceleration allowing for greater removal of redundancy without loss of accuracy).
In addition, we introduced spatial indexes in the databases in order to efficiently share geographical information across drivers.
This allows for instance to extrapolate the type of road and speed limit to drivers’ trips having very similar coordinates which in a sense amounts to clustering among different drivers.
ConclusionThe key to successfully derive predictive analytics from real life scenarios is often more about the data than the sophistication of the algorithm .
It is about being creative, and being able to extract innovative attributes that reveal the complexity of the data.
Only then, modelling algorithms will be able to exploit the full potential of the data and deliver the best predictions.
This work was made by Alicia Rodríguez, Senior Data Scientist and Florian Lyonnet, Chief Data Scientist at GDSLINK.
.. More details