You were using Euclidean distance.
It is the square root of the sum of squared differences between corresponding elements of the two vectors.
The formula of Euclidean distance is as follows:where x and y are the two vectors.
Or coded in Python:K-means typically uses Euclidean distance to determine how similar (or dissimilar) two points are.
The procedure of K-means is simple and straight-forward.
Before we start the model, we need to determine the number of clusters we want.
A different number of clusters can lead us to completely different results.
There are several direct methods to perform this.
Among them, we find the elbow and silhouette methods.
We are going to use the elbow method.
Remember that clustering aims to define clusters where the points are more similar to each other more than to points in other clusters.
For this, we’ll consider the total intra-cluster variation (or total within-cluster sum of square (WSS)).
And we’ll try to minimize it.
The Elbow method looks at how the total WSS varies with the number of clusters.
For that, we’ll compute k-means for a range of different values of k.
Then, we calculate the total WSS.
We plot the curve WSS vs.
number of clusters.
Finally, we locate the elbow or bend of the plot.
This point is considered to be the appropriate number of clusters.
Why?.Because if we keep adding clusters, WSS will not improve but our ability to cluster points together will decrease.
Looking at the plot, we’ll select 5 clusters.
We forgot to mention!.The K in K-means refers to the number of clusters.
In our case, 5.
How does k-means clustering works?.The main idea is to select k centers, one for each cluster.
There are several ways to initialize those centers.
We can do it randomly, pass certain points that we believe are the center or place them in a smart way (e.
as far away from each other as possible).
Is it important where we put the centroids for the first time?.Yes, very important.
Different initialization centroids can lead to different results.
Then, we calculate the Euclidean distance between each point and the cluster centers.
We assign the points to the cluster center where the distance is minimum.
After that, we recalculate the new cluster center.
We select the point that is in the middle of each cluster as the new center.
And we start again, calculate distance, assign to cluster, calculate new centers.
When do we stop?.When the centers do not move anymore.
Let’s see Python implementation:Now, let’s check how our clusters look like:Coming back to our initial situation.
We wanted to know our customers.
So we can offer them suitable products.
The center point of each cluster matches the average customer of that segment.
Male or Female does not seem to have any influence (they are around 0.
Remember that this was a binary variable).
The most important features appear to be Annual Income and Spending score.
We have people whose income is low but spend in the same range — segment 0.
People whose earnings a high and spend a lot — segment 1.
Customers whose income is middle range but also spend at the same level — segment 2.
Then we have customers whose income is very high but they have most spendings — segment 4.
And last, people whose earnings are little but they spend a lot— segment 5.
Imagine that tomorrow we have a new member.
And we want to know which segment that person belongs.
No problem!.we can predict this:Now you are ready to launch your marketing campaign!→ Check my GitHub repository to watch the complete code.