Understanding SVD (Singular Value Decomposition)Truly Understanding SVD — The Intuitive Core IdeaHussein AbdullatifBlockedUnblockFollowFollowingFeb 21Back in elementary mechanics, you learned that any force vector can be decomposed into its components along the x and y axes:Congratulations.
Now you know what singular value decomposition is.
For it’s disappointing that almost every tutorial of SVD makes it more complicated than necessary, while the core idea is most simple.
Since mathematics is just the art of assigning different names to the same concept, SVD is nothing more than decomposing vectors onto orthogonal axes — we just decided it may need a more deluxe name.
Let’s see how this is the case.
When the vector (a) is decomposed, we get 3 pieces of information:The directions of projection — the unit vectors (v₁ and v₂) representing the directions onto which we project (decompose).
In the above they’re the x and y axes, but can be any other orthogonal axes.
The lengths of projection (the line segments sₐ₁ and sₐ₂) — which tell us how much of the vector is contained in each direction of projection (more of vector a is leaning on the direction v₁ than it is on v₂, hence sₐ₁>sₐ₂).
The vectors of projection (pₐ₁ and pₐ₂) — which are used to reconstruct the original vector a by adding them together (as a vector sum), and for which it’s easy to verify that pₐ₁=sₐ₁*v₁ and pₐ₁=sₐ₁*v₂ — So they’re redundant, as they can be deduced from the former 2 pieces.
Critical Conclusion:Any vector can be expressed in terms of:1.
Projection directions unit vectors (v₁, v₂, …).
The lengths of projections onto them (sₐ₁, sₐ₂, …).
All what SVD does is extend this conclusion to more than one vector (or point) and to all dimensions :An example of a dataset (a point can be considered a vector through the origin).
Now it becomes a matter of knowing how to handle this mess.
How To Handle This MessWe can’t handle that mess without first handling a single vector!If you look at many generalizations done in mathematics, you’ll find they primarily utilize matrices.
So we have to find a way to express the operation of vector decomposition using matrices.
It turns out to be a natural thing to do:Same figure as before, but tilting the axes of projection to convince you they aren’t confined to x and y.
(aₓ and aᵧ are the coordinates of vector a, put into a column matrix (aka column vector), as per convention.
Same for v₁ and v₂).
We want to decompose (project) the vector a along unit vectors v₁ and v₂.
You may already know (especially if you’ve watched this) that projection is done by the dot product — it gives us the lengths of projection (sₐ₁ and sₐ₂):Projecting (a) onto v1 and v2.
But that’s redundant.
We can utilize the efficiency of matrices……to write both equations in one go, by adding an extra column for each unit vector.
We can even add more points……by adding an extra row for each point.
S is the matrix containing the lengths of projections.
Here’s how it looks like, after adding that point b:It’s now easy to generalize to any number of points and dimensions:n = no.
of points, d = no.
of dimensions, A = matrix containing points, V = matrix containing the decomposition axes, S = matrix containing lengths of projection.
Mathematical elegance at its best.
Recap:-The dot product in this case is just ordinary matrix multiplication.
That’s exactly the same as saying:Because V contains orthogonal columns, its inverse = its transpose (property of orthogonal matrices).
Which is all what SVD says (remember the Critical Conclusion):Any set of vectors (A) can be expressed in terms of its lengths of projections (S) on some set of orthogonal axes (V).
However, we are not quite there yet.
The conventional SVD formula says:But that just means we want to see how:And that’s what we’re going to do.
If you look carefully at the matrix S, you’ll discover it consists of:It turns out (for reasons to be seen later) that it’s best if we could normalize these column vectors, i.
make them of unit length.
This is done by doing the equivalent of dividing each column vector by its magnitude, but in matrix form.
But first, a numerical example to see how this “division” thing is done.
Let’s say we want to divide the 1st column of M by 2.
We will surely have to multiply by another matrix to preserve the equality:It’s straightforward to verify that the unknown matrix is nothing more than the identity matrix, with the 1st element replaced by the divisor = 2:Dividing the 2nd column by 3 now becomes a direct matter — just replace the 2nd element of the identity matrix by 3:It should be obvious how this operation can be generalized to any matrix of any size.
We now want to apply the above “division” concept to the matrix S.
To normalize the columns of S, we divide them by their magnitude……by doing with S what we did with M in the example above:Finally…Singular Value DecompositionOf course, some fine details and rigorous mathematics were, justifiably, swept under the rug, in order to not distract from the core concept.
InterpretationLet’s talk about this U and Σ …What about the sigmas?.Why did we burden ourselves with normalizing S to find them?We’ve already seen that (σᵢ) is the square root of the sum of squared projection lengths, of all points, onto the ith unit vector vᵢ.
What does that mean?Red segments = projections on v1.
Blue segments = projections on v2.
The closer the points to a specific axis of projection, the larger the value of the corresponding σ.
Since the sigmas contain, in their definition, the sum of projection lengths onto a specific axis, they represent how close all the points are to that axis.
if σ₁ > σ₂, then most points are closer to v₁ than v₂, and vice versa.
That’s of immense utility in the myriad applications of SVD.
The Main ApplicationThe algorithms of finding the SVD of a matrix don’t choose the projection directions (columns of matrix V) randomly.
They choose them to be the Principal Components of the dataset (matrix A).
If you’ve read my first article, you know very well what the principal components are……they’re the lines of largest variation (largest variance).
You also know from the same that the goal of dimensionality reduction is to project the dataset on the line (or plane) of largest variance:Magenta: points before projection.
Violet: points after projection (reduced dimensionality).
Now the act of projecting the dataset using SVD becomes a snap, since all the points are already projected (decomposed) on all the principal components (the vᵢ unit vectors):So, for example, to project the dataset on the 1st principal component……all we have to do is remove all columns not related to the 1st principal component.
The projected dataset in now A’.
Multiplying the two matrices (S and Vᵀ above) results in the matrix A′ containing the projected points (violet) in the last graph.
And that’s it…for now.