For this purpose we use the loss metric:Here (1) represents the base weights (all 1's), and ρ represents the resulting fuzzy partition matrix that is a product of the weights used in the euclidean distance function between points p and q.We can then attempt to use Gradient Descent on this loss function to try and minimize it with respect to the similarity matrix..Gradient Descent is one of the most common optimization algorithms in machine learning that is used to find best parameters of a given function by using the function gradient, a combination of the partial derivatives..By taking steps proportional to the negative of the gradient, we can try to find the local minimum of the function..We will continually update the weights until either our maximum number of iterations has been met, or the function converges..So the gradient descent will be of our loss function with a partial derivative in respect to the weights.Where n is the learning rate defined..n is a very important parameter, as something too small will require too much computation, while too big and the function may never converge.If you can think of it in terms of a 3D graph, it would be like stretching or shrinking each axis, in a way that would put our points into tighter groups, that are further away from each other..We are not actually changing the locations of the data, we are solely transforming how we measure the distances that drive our similarity metrics.Here is a created example where I introduce 3 clusters with separate centroids on the first two variables, but introduce a third noise variable that would make the clustering more difficult..These are colored by the actual cluster labels given when the data is created..When eliminating the third noise variable, we can see it would be much easier to identify clusters.Although it mostly is difficult to see the differences because of the 3D perspective, you can see how much more defined the clusters are with the learned feature weights..By stretching out the main feature that can easily separate them, it was able to better identify clusters.Measuring ImprovementA good representation of its effectiveness is fuzzy c-means, a relative of the commonly used k-means algorithm..It works in a very similar fashion to k-means, but rather results in something called the fuzzy partition matrix instead of just a cluster label.The fuzzy partition matrix is a set of weights that measure how similar a single point is to a given cluster center, close to how our similarity matrix is used previously..It can also be calculated using a weighted distance metric which we can feed our new found optimal weights.. More details