There is no universal answer for this, and although the optimal number of centroids or clusters is not known a priori, different approaches exist to try to estimate it.
One commonly used approach is testing different numbers of clusters and measure the resulting sum of squared errors, choosing the K value at which an increase will cause a very small decrease in the error sum, while a decrease will sharply increase the error sum.
This point that defines the optimal number of clusters is known as the “elbow point”, and can be used as a visual measure to find the best pick for the value of K.
In this example, the elbow point is located in 3 clustersK-means is a must-have in your data science toolkit, and there are several reasons for this.
First of all it’s easy to implement and brings an efficient performance.
After all, you need to define just one parameter (the value of K) to see the results.
It is also fast and works really well with large datasets, making it capable of dealing with the current huge volumes of data.
It’s so flexible that it can be used with pretty much any datatype and its results are easy to interpret and more explainable than other algorithms.
Furthermore, the algorithm is so popular that you may find use cases and implementations in almost any discipline.
But everything has a downsideNevertheless, K-means presents some disadvantages.
The first one is that you need to define the number of clusters, and this decision can seriously affect the results.
Also, as the location of the initial centroids is random, results may not be comparable and show lack of consistency.
K-means produces clusters with uniform sizes (each cluster with roughly an equal quantity of observations), even though the data might behave in a different way, and it’s very sensitive to outliers and noisy data.
Additionally, it assumes that data points in each cluster are modeled as located within a sphere around that cluster centroid (spherical limitation), but when this condition (or any of the previous ones) is violated, the algorithm can behave in non-intuitive ways.
Example 1Example 1: On the left-hand side the intuitive clustering of the data, with a clear separation between two groups of data points (in the shape of one small ring surrounded by a larger one).
On the right-hand side, the same data points clustered by K-means algorithm (with a K value of 2), where each centroid is represented with a diamond shape.
As you see, the algorithm fails to identify the intuitive clustering.
Example 2Example 2: On the left-hand side the clustering of two recognizable data groups.
On the right-hand side, the result of K-means clustering over the same data points does not fit the intuitive clustering.
As in the case of example 1, K-means created partitions that don’t reflect what we visually identify due to the algorithm’s spherical limitation.
It tries to find centroids with neat spheres of data around them, and performs badly as the cluster’s geometric shape deviates from a sphere.
Example 3Example 3: Once again, on the left-hand side there are two clear clusters (one small and tight data group and another larger and dispersed one) which K-means fails to identify (right-hand side).
Here, in an attempt to balance the intra-cluster distances between both data groups and generate clusters with uniform sizes, the algorithm blends both data groups and creates 2 artificial clusters that don’t represent the dataset.
It’s interesting to see that K-means doesn’t allow data points that are far away from each other to share the same cluster, no matter how obvious the relation between these data points might be.
What to do now?The thing is real life data is almost always complex, disorganized and noisy.
Situations in the real world rarely reflect clear conditions in which to apply these type of algorithms right out of the shelf.
In the case of K-means algorithm it will be expected that at least one of its assumptions gets violated, so we need not only to identify this, but to know what to do in such case.
The good news is that there are alternatives, and deficiencies can be corrected.
For example, converting data to polar coordinates can solve the spherical limitation we described in example 1.
You may also consider using other types of clustering algorithms if you find serious limitations.
Possible approaches would be to use density-based or hierarchical-based algorithms, which fix some of K-means limitations (but have their own limitations).
In summary, K-means is a wonderful algorithm with lots of potential uses, so versatile it can be used for almost any kind of data grouping.
But there is never a free lunch: you need to be aware of its assumptions and the way it operates if you don’t want to get guided to wrong results.
Thanks Sabrina Steinert for your valuable inputsInterested in these topics?.Follow me on Linkedin or Twitter.