# Machine learning: clustering

I will present two algorithms in this order: K-means and hierarchical clustering.

K-meansK-means’ goal is to reveal patterns within the data.

Let’s imagine that you have a database with millions of rows representing your customers’ orders.

You might want to use the K-means algorithm to gather your customers into different groups based on key characteristics.

K-means is easy to implement; you just need to specify how many clusters you want.

Here you should look at two measures.

First, relevant clusters are compact: observations within the same cluster have to be similar on the key features; and second, relevant clusters are well separated from each other: observations that don’t belong to the same cluster have to be different.

The compactness is measured with the sums of squares (WSS).

Your goal here is to minimize the WSS.

The smaller the WSS, the more compact are your clusters.

On the other hand, you want to make sure that different observations belong to different clusters.

The measure of this separation is called between cluster sums of squares (BSS).

Here you want to maximize the BSS.

Here is the trade-off:The more you draw clusters, the better your separation measures (BSS) will be.

If you have one million customers, and you lay out one million clusters, then your measure of separation will be at the top.

But, your measure of compactness (WSS) will come out empty.

So, to optimize your clustering algorithm you need to minimize the WSS and maximize the BSS.

You start to understand the trade off.

Here is a way to solve it mathematically: lay down this trade off on a simple graph and take your decision from there.

The more clusters you have, the more compactness you leave behind.

The less clusters you have, the worse your separation measure will be.

You can then summarize your results using the cluster’s centroid.

The cluster’s centroid is the middle of a cluster.

It averages all points of one cluster.

Therefore, it is useful to compare your different clusters.

To conclude, the main limits of the K-means algorithm are to be found on the assumptions that it makes: Does all your data have the same variance?.Are they sharing a similar and spherical distribution?.Of course not.

Some signals are left behind and some noise is selected.

The more dimensions you have the more severe these limits will manifest themselves in your model.

In addition, your K-means algorithm results depend on where the centroid starts.

As the centroids start in different places, they might also end up in different centres.

That means the same K-means algorithm, re-run, might give a different answer!.To solve this issue you should run it several times.

Also, be careful of your scale.

Most of the time your features have to be transformed before running your clustering algorithm.

Hierarchical ClusteringIf the limitation of the K-means algorithm are too important for the specific issue at hand, you can try hierarchical clustering.

It provides a more in-depth analysis, with a good linkage-method.

However, this algorithm comes with high computational cost and can never undo merges.

To run this algorithm correctly, you need to ask two different questions than for the K-means algorithm.

What to cluster first?.And, which cluster to merge?You can either do a bottom-up or a top-down approach.

The more used is bottom-up: it starts from each of your data points and builds a hierarchy of clusters.

It’s working by calculating the distance between all our data points and grouping the closest together.

With the use of a dendrogram you are able to see when and what gets merged; so you can choose where to stop the merging of clusters and keep the most relevant clusters.

At one extreme the dendrogram shows one observation per cluster, at the other extreme there is only one big cluster that is covering all your data points.

Find an example done in R here.. More details