Importance of Distance Metrics in Machine Learning Modelling

Well yes, we just saw this formula above in this article while discussing “Pythagorean Theorem”.

Euclidean distance formula can be used to calculate the distance between two data points in a plane.

Cosine Distance:Mostly Cosine distance metric is used to find similarities between different documents.

In cosine metric we measure the degree of angle between two documents/vectors(the term frequencies in different documents collected as metrics).

This particular metric is used when the magnitude between vectors does not matter but the orientation.

Cosine similarity formula can be derived from the equation of dot products :-Now, you must be thinking which value of cosine angle will be helpful in finding out the similarities.

Now that we have the values which will be considered in order to measure the similarities, we need to know what does 1, 0 and -1 signifies.

Cosine ranges from 1 for vectors pointing in the same direction i.

e.

there are similarities between the documents/data points.

Over zero for orthogonal vectors i.

e.

Unrelated and -1 for vectors pointing in opposite directions.

There is more to how does cosine similarity further manipulated to find out the matching documents.

Mahalanobis Distance:Mahalanobis Distance is used for calculating the distance between two data points in a multivariate space.

According to Wikipedia Definition,The Mahalanobis distance is a measure of the distance between a point P and a distribution D.

The idea of measuring is, how many standard deviations away P is from the mean of D.

The benefit of using mahalanobis distance is, it takes covariance in account which helps in measuring the strength/similarity between two different data objects.

The distance between an observation and the mean can be calculated as below -Here, S is the covariance metrics.

We are using the inverse of covariance metrics here to get a variance-normalized distance equation.

Now that we have a basic idea about different distance metrics, we can move to the next step i.

e.

machine learning techniques/modelling that uses these disatance metrics.

Machine Learning Modelling and distance metricsIn this section, we will be working on some basic classification and clustering use cases.

This will help us in understanding the usage of distance metrics in machine learning modelling.

We will start with quick introduction of supervised and unsupervised algorithms and slowly will move on to the examples.

1.

ClassificationK-Nearest Neighbors(KNN)-KNN is a non-probabilistic supervised learning algorithm i.

e.

it doesn’t produce the probability of membership of any data point rather KNN classifies the data on hard assignment, e.

g the data point will either belong to 0 or 1.

Now, you must be thinking how does KNN work if there is no probability equation involved.

KNN uses distance metrics in order to find similarities or dissimilarities.

Let’s take iris dataset which has three classes and see how KNN will identify the classes for test data.

In the #2 image above the black square is a test data point.

Now, we need to find which class this test data point belong to, with the help of KNN algorithm.

We will now prepare the dataset to create machine learning model to predict the class for our test data.

#Import required libraries#Import required librariesimport numpy as npimport pandas as pdfrom sklearn.

model_selection import train_test_splitfrom sklearn.

neighbors import KNeighborsClassifierfrom sklearn.

metrics import accuracy_score#Load the dataseturl = "https://raw.

githubusercontent.

com/SharmaNatasha/Machine-Learning-using-Python/master/Datasets/IRIS.

csv"df = pd.

head(5)#Separate data and labelx = df.

iloc[:,1:4]y = df.

iloc[:,4]#Prepare data for classification processx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.

3, random_state=0)In KNN classification algorithm, we use a user defined constant “K”.

K is the number of nearest neighbours of a test data point and these nearest data points then will be used to decide the class for test data point.

Are you wondering that how would we find the nearest neighbours.

Well that’s where the distance metric comes into pictures.

First, we calculate the distance between each train and test data point and then select the top nearest according to the value of k.

We won’t be creating the KNN from scratch but will be using scikit KNN classifier.

#Create a modelKNN_Classifier = KNeighborsClassifier(n_neighbors = 6, p = 2, metric='minkowski')You can see in the above code we are using Minkowski distance metric with value of p as 2 i.

e.

KNN classifier is going to use Euclidean Distance Metric formula.

As we move forward with machine learning modelling we can now train our model and start predicting the class for test data.

#Train the modelKNN_Classifier.

fit(x_train, y_train)#Let's predict the classes for test datapred_test = KNN_Classifier.

predict(x_test)Once the top nearest neighbours are selected, we check most voted class in neighbours -From the above image, can you guess the class for test point?.It’s class 1 as it is most voted class.

Through this small example we saw how distance metric was important for KNN classifier.

It helped us to get the closest train data points for which classes were known.

There is a possibility that using different distance metrics we might get better results.

So, in non-probabilistic algorithm like KNN distance metrics plays an important role.

2.

ClusteringK-means-In classification algorithms, probabilistic or non-probabilistic we will be provided with labeled data so, it gets easier to predict the classes.

Though in clustering algorithm we have no information on which data point belongs to which class.

Distance metrics are important part of these kind of algorithm.

In K-means, we select number of centroids that define number of clusters.

Each data point will then be assigned to its nearest centroid using distance metric (Euclidean).

We will be using iris data to understand the underlying process of K-means.

In the above image #1 as you can see we randomly placed the centroids and in the image #2, using distance metric tried to find their closest cluster class.

import numpy as npimport pandas as pdfrom sklearn.

cluster import KMeansimport matplotlib.

pyplot as plt#Load the dataseturl = "https://raw.

githubusercontent.

com/SharmaNatasha/Machine-Learning-using-Python/master/Datasets/IRIS.

csv"df = pd.

head(5)#Separate data and labelx = df.

iloc[:,1:4].

values#Creating the kmeans classifierKMeans_Cluster = KMeans(n_clusters = 3)y_class = KMeans_Cluster.

fit_predict(x)We will need to keep repeating the assignment of centroids until we have a clear cluster structure.

As we saw in the above example, without having any knowledge about the labels with the help of distance metric in K-Means we clustered the data into 3 classes.

3.

Natural Language ProcessingInformation RetrievalUnlike classification and clustering, in information retrieval we work with unstructured data.

The data can be an article, website, emails, text messages, a social media post etc.

With the help of techniques used in NLP we can create vector data in a manner that can be used to retrieve information when queried.

Once the unstructured data is transformed into vector form, we can use cosine similarity metric to filter out the irrelevant documents from the corpus.

Let’s take an example and understand the usage of cosine similarity.

Create vector form for Corpus and Query-import mathimport numpy as npimport pandas as pdimport matplotlib.

pyplot as pyplotfrom sklearn.

metrics.

pairwise import cosine_similarityfrom sklearn.

feature_extraction.

text import TfidfVectorizervectorizer = TfidfVectorizer()corpus = [ 'the brown fox jumped over the brown dog', 'the quick brown fox', 'the brown brown dog', 'the fox ate the dog']query = ["brown"]X = vectorizer.

fit_transform(corpus)Y = vectorizer.

transform(query)2.

Check the similarities i.

e find which document in corpus is relevant to our query-cosine_similarity(Y, X.

toarray())Results:array([[0.

54267123, 0.

44181486, 0.

84003859, 0.

]])As you can see from the above example, we queried for word “brown” and in corpus there are only three documents which contain word “brown”.

When checked with cosine similarity metric it gave the same results by having >0 values for three document except the forth one.

ConclusionThrough out this article, we got to know about few popular distance/similarity metrics and how these can be used in order to solve complicated machine learning problems.

Hope this will be helpful for people who are in their first stage of getting into Machine Learning/Data Science.

ReferencesCosine Similarity- Sklearn, TDS article, Wikipedia, ExampleGithub codeDistance Metrics- Math.

net, WikiMinkowski Distance Metric- Wiki, Blog, Famous Metrics.. More details