# The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need

We will be working on the big mart sales dataset that you can download here.

I encourage you to read more about the dataset and the problem statement here.

This will help you visualize what we are working on (and why we are doing this).

Two pretty important questions in any data science project.

First, import all the required libraries: View the code on Gist.

Now, we will read the CSV file and look at the first five rows of the data: View the code on Gist.

For this article, we will be taking only two variables from the data – “LoanAmount” and “ApplicantIncome”.

This will make it easy to visualize the steps as well.

Let’s pick these two variables and visualize the data points: View the code on Gist.

Steps 1 and 2 of K-Means were about choosing the number of clusters (k) and selecting random centroids for each cluster.

We will pick 3 clusters and then select random observations from the data as the centroids: View the code on Gist.

Here, the red dots represent the 3 centroids for each cluster.

Note that we have chosen these points randomly and hence every time you run this code, you might get different centroids.

Next, we will define some conditions to implement the K-Means Clustering algorithm.

Let’s first look at the code: View the code on Gist.

These values might vary every time we run this.

Here, we are stopping the training when the centroids are not changing after two iterations.

We have initially defined the diff as 1 and inside the while loop, we are calculating this diff as the difference between the centroids in the previous iteration and the current iteration.

When this difference is 0, we are stopping the training.

Let’s now visualize the clusters we have got: View the code on Gist.

Awesome!.Here, we can clearly visualize three clusters.

The red dots represent the centroid of each cluster.

I hope you now have a clear understanding of how K-Means work.

Here is a LIVE CODING window for you to play around with the code and see the results for yourself – without leaving this article!.Go ahead and start working on it: However, there are certain situations where this algorithm might not perform as well.

Let’s look at some challenges which you can face while working with k-means.

Challenges with the K-Means Clustering Algorithm One of the common challenges we face while working with K-Means is that the size of clusters is different.

Let’s say we have the below points: The left and the rightmost clusters are of smaller size compared to the central cluster.

Now, if we apply k-means clustering on these points, the results will be something like this: Another challenge with k-means is when the densities of the original points are different.

Let’s say these are the original points: Here, the points in the red cluster are spread out whereas the points in the remaining clusters are closely packed together.

Now, if we apply k-means on these points, we will get clusters like this: We can see that the compact points have been assigned to a single cluster.

Whereas the points that are spread loosely but were in the same cluster, have been assigned to different clusters.

Not ideal so what can we do about this?.One of the solutions is to use a higher number of clusters.

So, in all the above scenarios, instead of using 3 clusters, we can have a bigger number.

Perhaps setting k=10 might lead to more meaningful clusters.

Remember how we randomly initialize the centroids in k-means clustering?.Well, this is also potentially problematic because we might get different clusters every time.

So, to solve this problem of random initialization, there is an algorithm called K-Means++ that can be used to choose the initial values, or the initial cluster centroids, for K-Means.

K-Means++ to Choose Initial Cluster Centroids for K-Means Clustering In some cases, if the initialization of clusters is not appropriate, K-Means can result in arbitrarily bad clusters.

This is where K-Means++ helps.

It specifies a procedure to initialize the cluster centers before moving forward with the standard k-means clustering algorithm.

Using the K-Means++ algorithm, we optimize the step where we randomly pick the cluster centroid.

We are more likely to find a solution that is competitive to the optimal K-Means solution while using the K-Means++ initialization.

The steps to initialize the centroids using K-Means++ are: The first cluster is chosen uniformly at random from the data points that we want to cluster.

This is similar to what we do in K-Means, but instead of randomly picking all the centroids, we just pick one centroid here Next, we compute the distance (D(x)) of each data point (x) from the cluster center that has already been chosen Then, choose the new cluster center from the data points with the probability of x being proportional to (D(x))2 We then repeat steps 2 and 3 until k clusters have been chosen Let’s take an example to understand this more clearly.

Let’s say we have the following points and we want to make 3 clusters here: Now, the first step is to randomly pick a data point as a cluster centroid: Let’s say we pick the green point as the initial centroid.

Now, we will calculate the distance (D(x)) of each data point with this centroid: The next centroid will be the one whose squared distance (D(x)2) is the farthest from the current centroid: In this case, the red point will be selected as the next centroid.

Now, to select the last centroid, we will take the distance of each point from its closest centroid and the point having the largest squared distance will be selected as the next centroid: We will select the last centroid as: We can continue with the K-Means algorithm after initializing the centroids.

Using K-Means++ to initialize the centroids tends to improve the clusters.

Although it is computationally costly relative to random initialization, subsequent K-Means often converge more rapidly.

I’m sure there’s one question which you’ve been wondering about since the start of this article – how many clusters should we make?.Aka, what should be the optimum number of clusters to have while performing K-Means?.  How to Choose the Right Number of Clusters in K-Means Clustering?.One of the most common doubts everyone has while working with K-Means is selecting the right number of clusters.

So, let’s look at a technique that will help us choose the right value of clusters for the K-Means algorithm.

Let’s take the customer segmentation example which we saw earlier.

To recap, the bank wants to segment its customers based on their income and amount of debt: Here, we can have two clusters which will separate the customers as shown below: All the customers with low income are in one cluster whereas the customers with high income are in the second cluster.

We can also have 4 clusters: Here, one cluster might represent customers who have low income and low debt, other cluster is where customers have high income and high debt, and so on.

There can be 8 clusters as well: Honestly, we can have any number of clusters.

Can you guess what would be the maximum number of possible clusters?.One thing which we can do is to assign each point to a separate cluster.

Hence, in this case, the number of clusters will be equal to the number of points or observations.

So, The maximum possible number of clusters will be equal to the number of observations in the dataset.

But then how can we decide the optimum number of clusters?.One thing we can do is plot a graph, also known as an elbow curve, where the x-axis will represent the number of clusters and the y-axis will be an evaluation metric.

Let’s say inertia for now.

You can choose any other evaluation metric like the Dunn index as well: Next, we will start with a small cluster value, let’s say 2.

Train the model using 2 clusters, calculate the inertia for that model, and finally plot it in the above graph.

Let’s say we got an inertia value of around 1000: Now, we will increase the number of clusters, train the model again, and plot the inertia value.

This is the plot we get: When we changed the cluster value from 2 to 4, the inertia value reduced very sharply.

This decrease in the inertia value reduces and eventually becomes constant as we increase the number of clusters further.

So, the cluster value where this decrease in inertia value becomes constant can be chosen as the right cluster value for our data.

Here, we can choose any number of clusters between 6 and 10.

We can have 7, 8, or even 9 clusters.

You must also look at the computation cost while deciding the number of clusters.

If we increase the number of clusters, the computation cost will also increase.

So, if you do not have high computational resources, my advice is to choose a lesser number of clusters.

Let’s now implement the K-Means Clustering algorithm in Python.

We will also see how to use K-Means++ to initialize the centroids and will also plot this elbow curve to decide what should be the right number of clusters for our dataset.

Implementing K-Means Clustering in Python We will be working on a wholesale customer segmentation problem.

The data is hosted on the UCI Machine Learning repository.

The aim of this problem is to segment the clients of a wholesale distributor based on their annual spending on diverse product categories, like milk, grocery, region, etc.

So, let’s start coding!.We will first import the required libraries: View the code on Gist.

Next, let’s read the data and look at the first five rows: View the code on Gist.

We have the spending details of customers on different products like Milk, Grocery, Frozen, Detergents, etc.

Now, we have to segment the customers based on the provided details.

Before doing that, let’s pull out some statistics related to the data: View the code on Gist.

Here, we see that there is a lot of variation in the magnitude of the data.

Variables like Channel and Region have low magnitude whereas variables like Fresh, Milk, Grocery, etc.

have a higher magnitude.

Since K-Means is a distance-based algorithm, this difference of magnitude can create a problem.

So let’s first bring all the variables to the same magnitude: View the code on Gist.

The magnitude looks similar now.

Next, let’s create a kmeans function and fit it on the data: View the code on Gist.

We have initialized two clusters and pay attention – the initialization is not random here.

We have used the k-means++ initialization which generally produces better results as we have discussed in the previous section as well.

Let’s evaluate how well the formed clusters are.

To do that, we will calculate the inertia of the clusters: View the code on Gist.

Output: 2599.

38555935614 We got an inertia value of almost 2600.

Now, let’s see how we can use the elbow curve to determine the optimum number of clusters in Python.

We will first fit multiple k-means models and in each successive model, we will increase the number of clusters.

We will store the inertia value of each model and then plot it to visualize the result: View the code on Gist.

Can you tell the optimum cluster value from this plot?.Looking at the above elbow curve, we can choose any number of clusters between 5 to 8.

Let’s set the number of clusters as 6 and fit the model: View the code on Gist.

Finally, let’s look at the value count of points in each of the above-formed clusters: View the code on Gist.

So, there are 234 data points belonging to cluster 4 (index 3), then 125 points in cluster 2 (index 1), and so on.

This is how we can implement K-Means Clustering in Python.

End Notes In this article, we discussed one of the most famous clustering algorithms – K-Means.

We implemented it from scratch and looked at its step-by-step implementation.

We looked at the challenges which we might face while working with K-Means and also saw how K-Means++ can be helpful when initializing the cluster centroids.

Finally, we implemented k-means and looked at the elbow curve which helps to find the optimum number of clusters in the K-Means algorithm.

If you have any doubts or feedback, feel free to share them in the comments section below.

And make sure you check out the comprehensive ‘Applied Machine Learning‘ course that takes you from the basics of machine learning to advanced algorithms (including an entire module on deploying your machine learning models!).

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.