Dimension ReductionMulti-Dimension Scalingin PythonDiogo RibeiroBlockedUnblockFollowFollowingMay 29Photo by Lukasz Szmigiel on UnsplashMulti-Dimension Scaling is a distance-preserving manifold learning method.
All manifold learning algorithms assume the dataset lies on a smooth, non linear manifold of low dimension and that a mapping f: RD -> Rd (D>>d) can be found by preserving one or more properties of the higher dimension space.
Distance preserving methods assume that a manifold can be defined by the pairwise distances of its points.
In distance preserving methods, a low dimensional embedding is obtained from the higher dimension in such a way that pairwise distances between the points remain the same.
Some distance preserving methods preserve spatial distances (MDS) while some preserve graph distances.
MDS is not a single method but a family of methods.
MDS takes a dissimilarity matrix D where Dij represents the dissimilarity between points i and j and produces a mapping on a lower dimension, preserving the dissimilarities as closely as possible.
The dissimilarity matrix could be observed or calculated from the given dataset.
MDS has been widely popular and developed in the field of human sciences like sociology, anthropology and especially in psychometrics.
Let’s understand it better with an example from the MDS book.
The table below represents correlations between rates of different types of crimes from the US in 1970 and the image shows it MDS embedding.
As the number of variables increases, it becomes harder for a human mind to identify the relationships among the variables.
The relative position of points on the plot depends on the dissimilarity between them in the table of correlations, i.
crime rates which share high correlation are mapped close to each other, while crime rates which do not share high correlation are mapped farther apart.
From the figure we can see that along horizontal dimension crime distribution can be interpreted as “violent crime vs property crime” whereas on vertical dimension distribution of crime can be interpreted as the “street crime vs hidden crime”.
MDS can be divided into two categories:Metric MDS — Metric MDS is used for quantitative data and tries to preserve the original dissimilarity metrics.
Given a dissimilarity matrix D , a monotone function f, and p(number of dimension in subspace) metric MDS tries to find an optimal configuration X ⊂ Rp s.
f(Dij) ≈ d ij = (xi − xj)2 .
Another version of metric MDS is classical MDS(original MDS) formulation which provides closed form solution.
Intead of trying to approximate the dissimilarity metrics in lower dimension, it uses eigenvalue decomposition for in the solution.
Non-Metric MDS — Non-metric MDS is used for ordinal data.
It tries to keep the order of dissimilarity metrics intact.
For example, if Pij is dissimilarity between ith & jth and P32 > P89, then non-metric mds creates a mapping s.
We will implement metric MDS using SMACOF( scaling by majorization of complicated function) algorithm.
Before diving into the implementation of metric MDS, we need to learn a bit about MM( Majorization- Minorization) algorithm for optimization.
MM for finding an optimum of a functionMM algorithm is an iterative algorithm for finding optimum of a complex function.
Suppose we have a function f(x) for which we need to find a minimum.
Instead of directly optimizing f(x) MM uses an approximating function g(x,xm) to find an optimum.
If problem is to find a minimum of f(x) then, g(x,xm) is called majorizing function else minorizing function and xm is called support point.
If g(x,xm) is majorizing function of f(x) then, it has to satisfy following conditions:Optimizing g(x,xm) should be easier than f(x) .
For any x , f(x) ≤ g(x,xm)f(xm)=g(xm,xm)Steps of MM algorithm:choose a random support point xmfind xmin = argminx g(x,xm)if f(xmin)−f(xm)≈ e break (where e is a very small number) else go to step 4set xm=xmin and go to step 2Let’s understand the MM algorithm with an example.
The plot in green is the function for which we need to find the minimum.
Each image represents an iteration of the algorithm with the first image as the initial set up.
Our initial pivot point is xm=9.
00 , the minimum of g(x,9) is at xmin=6.
Now if we move to the next image, we see that xm is now 6.
49 and we have a new xmin=6.
20 based on g(x,6.
If we move to next iteration xm becomes 6.
20 and xmin value changes to 5.
Continuing to subsequent iterations, we see the minima moves towards the minimum of the green plot(as shown in the image below).
The most important part of MM algorithm is to find a good approximating function.
Now, let’s move to SMACOF algorithm for metric MDS.
As it was said earlier that metric MDS tries to approximate the dissimilarity matrix and minimizes the stress function which is given byσ(X)=Σij wij(δij−dij(X))2 , wherewij is weight assigned to dissimilarity between i and jδij is element of the given dissimilarity matrixdij(X) is the dissimilarity from the X which we need to findWe aren’t going to delve into the derivation of the majorizing function for stress function.
Steps of MDSCreate a dissimilarity matrix.
choose a random point Xm as a pivot.
Find the minimum of stress majorizing functions.
if σ(Xm) − σ(Xmin) < e break else set Xm=Xmin and go to step 2Step One — Dissimilarity matrix:We need a distance metric to calculate the distance between two points in the dataset.
Let’s go through a few popular distance metrics quickly.
Euclidean Metric: This is the most popular metric used in distance measurement.
It is equal to the straight line distance between two points.
Manhattan Metric: The distance between two points is measured along right angles to the axes.
It is also known as rectilinear distance, taxi-cab metric or city block distance.
Cosine Metric: The cosine distance measures the cosine of the angle between two vectors.
Smaller the value of cosine closer will the points.
Mahalanobis Metric: Mahalanobis metric is used to measure a point p’s distance to a distribution D.
It is particularly useful in detecting outliers.
Hamming Metric: Hamming metric counts the number of places at which entries in two vectors differ.
It is an important metric in information theory.
We will be using Euclidean metric as a dissimilarity measure.
from sklearn import datasetsimport math as maimport numpy as npfrom pyspark.
sql import types as tfrom pyspark.
sql import functions as fdigits = datasets.
load_digits(n_class=6)data = digits.
data# repartitioning the dataframe by id column will speed up the join operation df = spark.
cache()euclidean = lambda x,y:ma.
array(y))**2))data_bc = sc.
collect())# create the distance metricdef pairwise_metric1(y): dist =  for x in data_bc.
value: dist += [ma.
array(y))**2))] return(dist)udf_dist1 = f.
DoubleType()))df = df.
withColumn("D", udf_dist1("features"))Step Two: SCAMOF algorithmn,p = data.
shapedim = 2X = np.
rand(n,dim)# randomly initialize a solution for the pivot point.
dfrand = spark.
repartition("id2")df = df.
drop("id1")def pairwise_metric2(y): dist =  for x in X_bc.
value: dist += [ma.
array(y))**2))] return(dist)# create the matrix Bdef B(id,x,y): y,x = np.
0] = np.
inf z = -x/y z[id] = -(np.
tolist())# function for matrix multiplication using outer multiplicationdef df_mult(df, col1, col2, n1, n2, matrix=True): udf_mult = f.
DoubleType())) df = df.
withColumn("mult", udf_mult(col1, col2)) df = df.
col("mult")[i]) for i in range(n1*n2)])).
toDF("mult") if not matrix: return(df) st = t.
DoubleType()))])) udf_arange = (f.
tolist()) for i,j in enumerate(np.
reshape(n1,n2)/n1)], st)) df = (df.
alias("mult"))) df = (df.
repartition("id2")) return(df)udf_B = f.
DoubleType()))udf_sigma = (f.
udf(lambda x,y: float(np.
DoubleType()))sigma_old = np.
inftol = 1e-4max_iter = 1000for i in range(max_iter): X_bc = sc.
collect()) def pairwise_metric2(y): dist =  for x in X_bc.
value: dist += [ma.
array(y))**2))] return(dist) udf_dist2 = f.
DoubleType())) df = df.
withColumn("di", udf_dist2("X")) df = df.
withColumn("sigma", udf_sigma("D","di")) sigma_new = df.
collect() print(sigma_old, sigma_new) sigma_old = sigma_new df = df.
drop("di") X_min = df_mult(df, "B", "X", n, dim) df = df.
select("id", "D", f.
alias("X")) # cache action will prevent recreation of dataframe from base df.
cache()The plot of MDS embedding.
Though the clusters are not clearly distinct they can be identified easily.
Drawbacks of MDS:MDS requires large computing power for calculating the dissimilarity matrix at every iteration.
It is hard to embed the new data in MDS.
Conclusion: Like PCA, MDS is an old method.
It has been well researched.
It has a few extensions like Sammon mapping.
With this post, we tried to develop an understanding of MDS and its working.
We brushed up a few areas related to MDS and implemented a basic MDS in pyspark.
A Jupyter Notebook with math and code (python and pyspark) is available on github.