Indeed, a single activity might not just be enough yet to decide our next holiday destination.
In order to really establish destinations, we will have to establish groups of activities that are within reasonable travelling time from each other.
In other words: time for a Cluster Analysis on our geodata!Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters) — WikipediaOur ApproachExplore the different clustering algorithms and it’s pros vs cons in relation to our problemSet constraints and variables to be used in the clustering process (i.
distance, min/max activities, etc.
)Apply clustering algorithm to our data, to be stored as a one to one relation between each activity and one clusterVisualize our clusters on the map and combine with our scored measure to be able to compare clusters to each otherThe DataThe data we need for clustering today is the geodata in terms of the Longitude and Latitude for each activity.
As we have already derived these using Google Maps’ Geocoding API in episode 1, there is no further data to be extracted to complete our task.
Clustering AlgorithmsThe biggest pitfall in applying any clustering analysis to your data is choosing the wrong clustering algorithm.
There are numerous different algorithms with different properties used in the way to calculate groups of data points that form a cluster.
Each problem is different, and each algorithm will yield different results.
Therefore whenever starting a clustering analysis its important to start by evaluating the pros and cons of the most common clustering algorithms.
Don’t apply the algorithm that you are most familiar with for each problem as it will not necessarily give you the optimized solution.
In this article I will not go in depth on each clustering algorithm, only analyzing which is best for our problem.
For a good read on the subject, and the algorithms we will explore today, I recommend this article by George Seif.
Personally I like schematic comparisons, so as a first step, in Fig.
1, you’ll find a very simplified comparison of the key properties of our problem, for the following algorithms: K-Means, Mean-Shift, DBSCAN, Expectation-Maximization and Agglomerative Hierarchical clustering.
please note:Only comparing on characteristics that are a direct requirement to solve our problem (not comparing many other cluster algorithm attributes).
Please do let me know if you think we overlooked an essential clustering algorithm that would be a perfect fit for our problem.
1 Comparing our requirements against clustering algorithmsLet’s breakdown and explain our requirements one by one:Number of clusters: Some clustering algorithms require the number of clusters to be known beforehand and will then allocate all data points to the given number of clusters.
However, at this point in our analysis, we have no valuable estimate of the number of clusters that could be derived from our activities.
Therefore we are looking for an algorithm that will derive the number of clusters from the data.
Matching algorithms: Mean-Shift, DBSCAN, Agglomerative HierarchicalNoise Handling: Not necessary all data points are to be considered part of a cluster.
For some algorithms its better to remove noise from the data-set before applying the clustering, while others have a way of defining noise in the process.
In our case, we do not want to delete noise but want it flagged instead.
We can have a great stand alone activity in the middle of nowhere that will not be part of any holiday destination, but still want it flagged as an outlier destination.
For this we need an algorithm with noise handling.
Matching algorithms: Mean-Shift, DBSCANTime Complexity: Of lesser importance at this stage, as we will only execute the clustering in a very low frequency and our data-set is not estimated to grow in a very fast pace.
However still we want to prevent having to refactor everything in a later stage if our data-set of activities does grow and an O(n²) algorithm has become unmanageable.
In our case we would aim for the best possible time complexity possible.
From an O(n²) algorithm (or worse) we deem the clustering unfit for our problem.
Matching algorithms: K-Means, Expectation-Maximization, DBSCANThresholds: The nature of some clustering algorithms provide for parameters with which to steer the clustering process and to set some boundaries for the cluster edges.
Very important in our case, as we would like to set some thresholds on how large a cluster can get to be manageable in an holiday period.
For example, looking at the scatter plot on coordinates in chapter 1, it is easy to see that most clustering algorithms will cluster the whole of Europe in a single cluster.
Quite hard to do in just a single holiday.
So we will need to be able to set some guidelines as in what thresholds we consider for a single cluster to be formed.
Matching algorithms: Mean-Shift, DBSCAN, Expectation-MaximizationOut of these five options, only the DBSCAN clustering algorithm ticks all the boxes for our problem (although still scoring only medium on time complexity) and on top of that, its required parameters are an excellent fit to our desired clustering.
Therefore, we are going to continue with DBSCAN and if we cannot get to the desired clustering results, we can always proceed with exploring the runner up algorithm: the Mean-Shift.
Setting Constraints & VariablesFirst, we’re gonna set the boundaries of what detail we are going to cluster.
At this stage I want to keep countries separated and only cluster activities within a single country.
Therefore, by the nature of clustering, small countries will probably become a single cluster.
And although there could be cross-border holiday destinations in the “real world”, we will still be able to form these at a later step in the process by combining neighboring clusters.
Further, specifically for DBSCAN we have to define two variables.
As DBSCAN is a density-based clustering algorithm, it requires an epsilon and the minimum population of a cluster, to define exactly what is the density that it should be looking for.
Epsilon: In our case we can use this epsilon variable to define the range in travel distance that we are willing to travel between activities within one full holiday.
However, since the distance will be extracted from coordinates, the actual input value required will be the linear distance instead of the travel distance.
Let’s be very pragmatic on this one and say that we do not want to waste a full day driving to the next holiday activities and that the maximum we are willing to cover is roughly half a day.
Assuming that it translates into an maximum amount of 500 km; and considering a 1.
4 benchmark, this will result in 357 km of linear distance that we will enter as our initial epsilon variable.
Minimum population of activities: The minimum number of activities that we wish in a single cluster is directly related to the actual duration of the holiday (which will be a required parameter at a later stage).
But the number of activities we can filter later at the moment of executing the algorithm.
Therefore for the clustering process we should go for the minimum number of activities that would fill the minimum possible duration of a holiday.
This minimum possible duration that we’ll consider, is a weekend get-away; three days and an equal minimum amount of activities.
Automatically meaning that if there are only two or less activities, within the epsilon distance, these will be considered an outlier and not part of any holiday destination.
Applying clustering algorithmFinally time to get to the fun stuff and apply the DBSCAN clustering to our data-set.
We will use python and the scikit learn clustering package for the actual implementation.
As an example of clustering we will pick the USA.
A large country with major activities scattered all around:Interactive Chart #1 All major activities in the USA in our data-setNow let’s apply the DBSCAN clustering algorithm, with the 500 km travel distance as an epsilon requirement and three activities as the minimum required population within a cluster.
The cluster name starts with the country ISO code, then the number of the cluster within that country and last the number of activities in the cluster.
Please note that ‘-1’ refers to an outlier.
Interactive Chart #2 All major activities clustered into holiday destinationsThat’s looking a lot more structured into actual destinations that we can travel to!.The clustering resulted in this case in 7 destination clusters:Mid-East with 27 major activities (from Chicago to New Orleans(!))North-East with 26 major activities (from Boston to Washington DC)Florida with 22 major activitiesSouth-West with 17 major activities (from LA to Bruce Canyon NP)Hawaii Islands with 5 major activitiesSan Francisco with 4 major activitiesColorado with 3 major activitiesAnother 7 activities did not make it into any cluster and are considered outliers.
Some great and exciting looking road-trips!.The size of the clusters is of lesser importance at the moment as this will still be fine-tuned at a later stage, when holiday duration comes into play.
Curious to how the variables influence the clustering?.Please see below a matrix with three different travel distances as epsilon variables (250, 500 and 700) and three minimum population requirements (1, 3, 5):In general the trend is clear: less travel distance means more and smaller clusters as well as more outliers.
While more travel distance means larger clusters and less of them.
Also, less minimum population means less outliers and thus more (small) clusters, while a higher minimum population automatically leads to many more outliers and less clusters.
Our choices of variables (the middle one) seems a very reasonable outcome for exploring holiday destinations in the US; Well balanced between number of clusters, cluster sizes and the number of outliers.
But what about the rest of the world?.Let’s explore it in the interactive chart below.
Be sure to zoom in to explore clusters and find a gem within these 248 different holiday destinations!As per our logic, much of the smaller countries (so basically everything in western and southern Europe) will be one cluster per country (except for island destinations), but larger countries have been separated into multiple destinations nicely.
Overall it is exactly the result that we aimed for at this stage; giving us the ability to start analyzing destinations instead of single activities.
Adding the score and visualizeNow that we have our holiday destinations, we can add back our scoring measure that we created in chapter 1 based on its activities’ reviews.
Let’s start with diving into the top 10 holiday destinations and their main city (in number of activities):Chart #1 Top 10 cluster with their main city based on weighted scoring measureWell, it seems we definitely have to put Colorado on the top of our bucket list!.Also interesting to see the neighboring clusters in Brazil and Argentina, around the Iguazu Falls, both in the top 5.
Lastly, also Easter Island is scoring very high as a holiday destination.
Probably not the first place that came to mind?Finally, we would like to see the weighted average score per holiday destination mapped out on the globe in a similar way that we mapped out all activities in chapter 1.
Please note that the size of the cluster refers to the number of activities included and the darkness of the color fill to the scoring measure (the darker, the better):Going forward we will consider these holiday destinations as our analysis objects and no longer talk in detail about single activities and / or countries as a destination.
Thank you for reading today’s article, next time we are going to tackle the TSP problem and create optimized road trips for each destination!For more information about myself feel free to checkout my LinkedIn.