Overall, clustering is a very useful tool to add to your data science tool kit.
However, clustering is not always appropriate for your data set.
If you are interested in venturing into the world of unsupervised machine learning with clustering, follow these five simple guidelines to see if clustering is really an appropriate solution for your data:1.
Does your data already have a potential class label?Using the existing class label in your data is often better than trying to create a new label for your data from clustering.
If you have the option, supervised machine learning almost always outperforms unsupervised learning in classification tasks.
For this Olympic athlete data, the Medal attribute is an obvious choice for a class label.
If you have data but have no way to organize the data into meaningful groups, then clustering makes sense.
But if you already have an intuitive class label in your data set, then the labels created by a clustering analysis may not perform as well as the original class label.
Is your data categorical or continuous?Many clustering algorithms (like DBSCAN or K-Means) use a distance measurement to calculate the similarity between observations.
Because of this, certain clustering algorithms will perform better with continuous attributes.
However, if you have categorical data, you can one-hot encode the attributes or use a clustering algorithm built for categorical data, such as K-Modes.
It should be noted that it does not make a lot of sense to calculate distance between binary variables.
Knowing how different clustering algorithms perform on different data types is essential for deciding if clustering makes sense for your data.
Height and weight are continuous attributes while Season is a categorical attribute.
What does your data look like?A simple visualization of your data with a scatter plot can provide insights into whether your data is well suited for clustering.
For example, below is a scatter plot of Olympic athlete height and weight.
Clearly the two attributes have a strong positive correlation and form a dense central grouping, apart from a few outliers.
Scatter plot for height and weightAfter running several clustering algorithms on this data, no distinct or meaningful groups were formed and it was determined that these attributes were not well suited for clustering!.However, by simply visualizing the data early on in the analysis, this conclusion could have been made sooner.
If visualization reveals that your data has no amount of separation or distinct groups, then clustering may not be appropriate.
Do you have a way to validate your clustering algorithm?In order to trust the clustering algorithm results, you must have a method for measuring the algorithm's performance.
Clustering algorithm performance can be validated with either internal or external validation metrics.
An example of internal validation is the silhouette score, a way to measure how well each observation is clustered.
The silhouette plot shows the relative size of the clusters, the average silhouette score, and whether observations may have been clustered incorrectly.
The red line in the graph below indicates the average silhouette score for six clusters: about 0.
45 (with 1 being perfect, 0.
45 is not a great score).
Silhouette plot using K-Means for Olympic athlete dataAn example of external validation is when the class label for a data set is already known, but you want to test how well a particular clustering algorithm performs on predicting the existing classes.
A noteworthy caveat to the external validation approach is that there is not a huge use case for clustering if the data already has class labels!To have confidence in your machine learning model, you must have a consistent metric for measuring model performance.
Clustering is no different.
You must have a way to quantitatively assess how well the model is clustering the data.
Before conducting a clustering analysis, consider which type of validation and which metric makes the most sense for your data.
Some algorithms may perform deceivingly well with certain validation metrics, so you may need to use a combination of performance metrics to negate this issue.
If you consistently achieve poor model performance, then clustering is not a good fit for your data.
Does clustering provide any new insight into the data?Let’s say that you meet all the above considerations: you have continuous data with no class label, you visualize the data and there is some separation, and you choose a validation metric that makes sense for your analysis.
You run a clustering algorithm on the data and obtain a reasonably high silhouette score.
Exciting!.Unfortunately, your work is not done.
After performing a clustering analysis, it is crucial to examine the observations in the individual clusters.
This step allows you to assess whether or not the clusters provide any new insight into the data.
Did the algorithm really find similar groups of observations and maximize intraclass similarity while minimizing between cluster similarity?An easy way to examine clusters is to calculate simple statistics for the observations in each cluster, such as the mean.
Below is the mean Olympic athlete height and weight for three clusters as a result of K-Means clustering.
Individual Cluster MeansNotice anything strange?.The mean heights and weights are almost identical.
This demonstrates that, while the algorithm did cluster the data, the clusters are not substantially different from each other!.If clustering fails to produce any new or useful insights into your data, then your data is not well suited for clustering.
ConclusionAs with any data science task, you can’t just throw an algorithm at the data.
You must understand your data and understand the original intentions of the algorithm.
Even if your data is not well suited for clustering, you can still try it.
It never hurts to explore your data and you never know; you many learn something new!Thanks for reading.
The data set used as an example can be found here.
Please feel free to leave any constructive feedback or connect with me on Medium!.