Well, there are several ways like cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm but the most popular or common criteria used is the elbow diagram.

To find ‘K’, we run the k-means algorithm for different K values and compare the results.

Although there are several metrics that can be used, one of them is the mean distance of the data points from their centers.

As this metric will always decrease with increasing K, we search for the sharp shift in the rate of decrease which is the desired value of K.

The python implementation using the scikit-learn library is as follows:Hierarchical and Density based clusteringSingle Link ClusteringIn this method, we start by clustering points that are closest to each other one by one based on the distance between them and later cluster points with existing clusters.

The criteria for clustering a point with an existing cluster or 2 existing clusters is based on the minimum distance between all possible pairs.

Single Link ClusteringNow, lets see how well single link is able to separate different types of data sets compared to k-means :k-means vs single link clusteringAs we can see, the single link clustering algorithm does a better job than k-means on the 2nd and 3rd data sets whereas k-means performs better on others although neither is able to do a very good job on the 5th data set.

Complete link, average link, ward clusteringComplete link clustering is similar to single link but it looks at the farthest distance between points while merging clusters.

Complete link clusteringWhile complete link gives more compact clusters and might be better than single link in this regard, it disregards certain points which might be a better contender to merge because of its distance criteria.

If you see here, the set of points in the right most cluster might be better suited to be merged with the cluster next to it but the largest distance criteria doesn’t allow it.

Average ClusteringAverage Clustering is again similar to single and complete link clustering but looks at the average distance between all points while merging clusters.

Average linkWard’s MethodWards method is the default method used in scikit-learn library in python for agglomerate clustering.

This tries to minimize the variance in the data sets by using a certain formula to calculate distance between clusters.

It takes the center of all points in both clusters and calculates the distance of all points from that center.

Next, it adds the square of all those distances.

Then it takes the center of the individual clusters and calculates the distance between all points and that center in each cluster and subtracts the sum of square of those distances from the formula.

Take a look :DBSCAN — Density based clusteringDBSCAN uses 2 parameters to find clusters in the data sets — cluster radius(epsilon) and minimum number of points needed to form a cluster(n).

So, we scan each point in the data and check whether it has n or more number of points around it within the radius epsilon.

And if a point meets the criteria, it becomes a core point and the points surrounding it become border points and a cluster is created and so on.

DBSCANLets look at a comparison with k-means clustering on different data sets :DBSCAN vs K-meansTake a look at the python implementation using sklearn and scipy :Gaussian Mixture Model ClusteringGMM follows the rule that each cluster is like a gaussian distribution of its own.

We all know about gaussian or normal distributions which look like a bell curve and have 68%, 95% and 99% of the data within 1, 2 and 3 standard deviations from the mean.

So if we have a data distribution in 1-D and ask scikit-learn to find 2 clusters in them, it will look somewhat like this:Here we can roughly see 2 gaussian distributions with different means and standard distributions.

We can have gaussian distribution in mutli dimentional datasets too, where there are 2 or more variable with different data distributions and GMM can be used for the same as well.

So, the GMM clustering works somewhat like this :Step-1: First we initialise the centres and standard deviations of the number of clusters we want, say 2.

This can be done either randomly or using a smarter logic like applying k-means on the dataset first and using the centers provided by k-means as initial centres for GMM.

Step-2: Next we calculate the membership of each point to each cluster by doing a soft clustering.

We use the probability density function of each cluster to determine the membership.

Take a look:Here, E[Z1A] is the membership of the 1st point to cluster A which is given by the probability density of point 1 in A divided by the sum of probability densities to cluster A and B.

Step-3: Re-estimate parameters of Gaussians Maximisation step: Now, we have to recalculate the mean and standard deviation of the clusters based on the results of the previous step.

The new mean and standard deviations are calculated using a weighted average approach as follows :Recalculating the cluster meansRecalculating the cluster standard deviationsStep-4: Evaluate Log-likelihood.

At the end of each re-clustering step, we calculate the log-likelihood of the clusters.

Here, π denotes the mixing coefficient, μ the cluster mean and σ the cluster standard deviation.

We calculate this log-likelihood after every iteration and our goal is to maximise this term.

The expectation maximisation algorithm converges when either this term is maximised or there is a negligible change in it after every iteration.

This is the point where we stop iterating and use the parameters obtained to get the GMM clusters.

Note: It is not necessary that the EM algorithm finds the global minimum every time and it depends a lot on how we initialise the clusters and the covariance criteria.

Lets look at two examples :Case -1: With random cluster initialisation and spherical covariance.

Case-2: With k-means initialisation and full covariance.

Case 1Case 2Clearly, the 2nd case is one we are more interested in and expect GMM to find for us but we need to help it find them.

In this case, even if we initialise the centres using k-means but use spherical covariance, we might not arrive where we want to.

Finally, lets look at the implementation in python:Random Projection and ICARandom projection much like PCA, is a dimensionality reduction technique where you project your dataset have d features or dimensions to a much lower k features or dimensions.

Whereas PCA projects the data onto an axis having maximum varies with respect to the data, Random projection projects the data to any random set of axis.

Random ProjectsRandom projection can be thought be as simple as a multiplication of the input data set to a random matrix like this:ICA or Independent Component Analysis on the other hand assumes that the data set is composed of different independent sources or axes and tries to separate them out.

For example, assume a scenario in which three friends go to an opera and listen to three different sources of music coming from a piano, a violin and a TV set.

Now all three of them record what they hear on their mobile phones.

Note that, all the three friends are at different distances from all three sources which is crucial point for ICA.

Independent Component AnalysisThis figure shows the recording of each individual for the different combination of the three sources and ICA tries to separate them out.

Remember, for ICA to work we need the number of samples to be equal to greater than the number of individual components.

Independent Component AnalysisLet’s take a look at the python implementation:These are some of the unsupervised learning techniques used for data which is not labelled and we want to find trends or do prediction modelling on it.

The choice of algorithm should be based on the type of data, problem statement and intuition.

Cheers !.