A Gentle Introduction to Deep Learning : Part 3PCA & Linear Algebra(Advance)Akshat JainBlockedUnblockFollowFollowingJan 27Photo by Antoine Dautry“You can’t build great building on a weak foundation”.
This quote truly justifies what I am trying to do here, you cannot learn the true form of machine learning or deep learning until you don’t have the knowledge of some of the important mathematical concepts like linear algebra and statistics.
And by keeping this goal in my mind I started this series.
In the Part 2 of this series we went over some of the basics of linear algebra and now in this part, I will cover some of the more advance topics of linear algebra and also we will learn about one of the important topic of machine learning, Principal Component Analysis(PCA) from scratch.
Before starting if you are unfamiliar of the previous parts(Part1 & Part2) of this series then I would suggest please do read them first.
Now let’s get started.
IntroductionThe sheer size of data is not only a challenge for computer hardware but also a main bottleneck for performance of many machine learning algorithms.
To compete with this many techniques have been created and Principal Component Analysis aka PCA is also one of them.
It is a simple and useful linear transformation technique, but it also has been a “black box” for many of us.
In this article I will try to open that black box and unravel its internals in greater detail and simplicity.
I am trying to cover every detail of PCA, with mathematics and theory both so it might possible that it all seems some alien language to you, but don’t give up, just be with me try to read it again and I am sure this will give you a complete understanding of PCA and you don’t have to worry about it ever again.
PCA — What, Why, How?2.
What?PCA is a technique used to reduce the dimensions of a d-dimensional dataset by projecting it onto a (k)-dimensional subspace (where k<d) in order to increase the computational efficiency while retaining most of the information.
Why?Let’s say you are working on a project of ‘house price prediction’ and at first you only have two features namely area of the house and its price.
Based on that data you can draw a plot to understand it more easily.
Plot in 2-D representing 2 variablesHere you can easily deduce the trend because of the visualization.
Now let’s say you have three features now area of the house, location and price.
So now to give this data a visual effect you have to use one more axis in your plot.
3-d plot representing 3 variablesBut if you have higher dimensions(>3), now it is difficult to understand the relationships between these variables which also makes it difficult to choose a particular tactic to solve the problem and also causes problems like overfitting.
How?PCA constructs new characteristics(features) from the old ones (linear combination of them).
It finds the best characteristics possible that summarises the variables, this is why it is useful.
So in our house price prediction example, if you have length, breadth, location as the features then new feature might be area which is a linear combination of length and breadth, and then we can discard those old features.
You might ask, “what does summarise means here?” There are two points that you should know to understand the working of PCA:1.
You are looking for some new feature that strongly differ for every example(create a pattern).
If you come up with some random new feature and it makes all the examples looks same then it is a bad summary.
You look for the feature that would allow you to reconstruct the original data back.
If you come with some new feature that has no relation with original data then there is no way you could reconstruct the original data back.
To elaborate this, let’s take two variables and there scatter plot looks like this:Plot of some examples showing two correlated featuresAs you can see these are correlated so we can make a new variable from their linear combination.
Now here how the projection of all these points looks like on different lines:Different line representing different projection of these pointsAs I said before, PCA will find the “best” possible line to create a new variable.
Now what is the definition of “best”, there are two criteria:1.
Maximum Variance:- The variance(spread of data) should be maximum which can be easily seen from the animation where the spread is maximum.
Minimum Reconstruction error:- If we reconstruct the original variable from the new variable then length of line connecting blue dot to red line(its projection) is known as reconstruction error and the total length should be minimum.
If you notice carefully, you will see that these both conditions fulfilled at the same time(at magenta ticks).
This is the working of PCA and now it’s time for some mathematical study of PCA.
But before that you should have knowledge of some of the topics, so I am covering them first here.
Some advanced topics you should know3.
Singular Value DecompositionIn the Part 2 of this series we saw how to decompose a square matrix into eigenvectors and eigenvalues.
But not every matrix can be decomposed using eigendecomposition.
Non-square matrices cannot be decomposed using this and also in some cases square matrix don’t have eigenvectors and eigenvalues.
This is where SVD or Singular Value Decomposition comes into picture and used to factorise a matrix into singular vectors and singular values.
Interesting thing to know about SVD is that any matrix can be decomposed using this.
According to SVD, a matrix A can be decomposed in the form of product of three matrices:A = UDVᵗwhere A is an mxn matrix,U is an orthogonal mxm matrix,D is a mxn diagonal matrix &V is an orthogonal nxn matrixHere U and V are called singular vectors and D is singular value.
Now let’s dig in more and find more about U, V and D.
To do this first lets see what we will get from,AᵗA = (VDᵗUᵗ)(UDVᵗ)Here we can solve only one thing and that is UᵗU and that will be equal to I(Identity matrix) because U is an orthogonal matrix.
So now the final result will be:AᵗA = VDᵗDVᵗ = VD²VᵗWe can conclude few things from above, first AᵗA is a square matrix and DᵗD is a diagonal matrix (i.
Dᵗ = D), so we can relate it with eigenvectors and eigenvalues now and so DᵗD or D² can be compared with λ i.
eigenvalue and V to the eigenvector.
So this all tells us about V and D but we still don’t know anything about U.
So for U, let’s find AAᵗAAᵗ = (UDVᵗ)(VDᵗUᵗ)As we saw above here also VᵗV will be equal to I(Identity matrix) and thenAAᵗ = UDDᵗUᵗ = UD²UᵗSo now we know about U too.
So by this we can easily compare SVD with eigendecomposition.
SVD is mostly used in reference with PCA and you will see later why.
The Moore-Penrose PseudoinverseSuppose that you want to solve the linear equation Ax = b, where A is a square matrix of nxn, b is a vector of nx1 and x is an unknown vector of nx1.
Now to solve for x, we multiply A⁻¹ on both sides:Ax = bA⁻¹Ax = A⁻¹bNow as we know that A⁻¹A = I,x = A⁻¹bBut as we saw that this is a special case in which A is a square matrix, now let’s consider the more general caseAx = bNow here A is a mxn matrix, x is unknown vector of size nx1 and b is a vector of size mx1.
But now we cannot solve for x like we did earlier because A is not invertible here.
Another way to look into this is its matrix form which looks something like this:I am not good in drawing!!Here the thing is we have too few unknowns(x) and too many equations(A), let me elaborate this for you, say you have following equations:2x₁ + 3x₂ = 5x₁ + 9x₂ = 78x₁ + 9x₂ = 19So it’s something like this there is no x₁ and x₂ which can solve above equations simultaneously.
This is the same condition that we are facing above.
If we see this graphically it will look something like thisThere is no line that can cover all these pointsSo here we have several points but it is impossible to find a perfect solution to all these points(to generate the polynomial), this introduces the least square problem where instead of finding the exact solution we need to find the best approximation.
This concludes that we need to find x such that the norm difference Ax-b is the smallest possible(in terms of least square).
Let’s do this we have,Ax’ ≈ bHere as you can see I took x’ instead x because now we are calculating approximate value.
Multiplying by Aᵗ on both sides,AᵗAx’ ≈ AᵗbNow we know that AᵗA is a square matrix and our final goal is to isolate x on the LHS, and to do this multiply on both sides by (AᵗA)⁻¹ which will result inx’ ≈ (AᵗA)⁻¹AᵗbThis equation is used to solve the least square problem and the term (AᵗA)⁻¹Aᵗ is known as pseudoinverse or generalized inverse and it is denoted by A⁺.
Trace OperatorThe trace operator gives the sum of all the diagonal entries of a matrix:Tr(A) = ∑ᵢ(Aᵢ,ᵢ)It is useful for a variety of reasons.
Some operations that are difficult to specify without resorting to summation notation can be specified using matrix products and the trace operator.
For example, trace operator provides an alternative way of writing Frobenius norm of a matrix:||A|| = √Tr(AAᵗ)Some properties:Tr(A) = Tr(Aᵗ)Tr(ABC) = Tr(CAB) = Tr(BCA), where A, B and C are square matricesTr(AB) = Tr(BA), where A ∈ ℝᵐˣⁿ and B ∈ ℝⁿˣᵐa = Tr(a), where a is a scalar3.
DeterminantThe determinant of a square matrix, denoted by det(A), is a function that maps matrices to real scalars or it can be viewed as volume scaling factor of linear transformation described by a matrix.
For a 2×2 square matrix,the determinant is:|A| = ad — bcand for a 3×3 matrix,the determinant is:|A| = a(ei − fh) − b(di − fg) + c(dh − eg)3.
1 MeanIt refers to the average that is used to derive the central tendency of the data.
For example if you have marks obtained by 5 students in a class such asX = [10, 5, 8, 4, 3]Then we define mean of this given data as:Formula for meanSimply by applying above formula on X we get,Mean = 30/5 = 63.
Standard DeviationAbove I told you about mean of data, but in reality it doesn’t tell us a lot about data it just give us an average value.
To elaborate my point let’s take two data sets have exactly same mean(10), but are obviously quite different:[0,8,12,20] and [8,9,11,12]So mean is exactly same in both cases, then what is the difference?.It is the spread of data that is different here.
The Standard deviation(SD) is the measure of how spread out the data is.
The formula to calculate SD is:Formula for Standard DeviationIn more words, it will tell you how tightly all the various examples clustered around the mean in a data set.
I hear you asking “Why are you using (n-1) instead of n?” Well the answer is a bit complicated, but in general say you are working on a subset of some big data set from real world(say surveying 500 people about the election) then you must use (n-1) because it turns out that it will give you the answer that is closer to standard deviation which you will get if you’d use n(i.
e if you’d survey entire population).
VarianceVariance is another measure of spread of data in a data set.
It measures how far a data is spread out.
It is almost identical to standard deviation.
The formula for variance is:Formula for VarianceAs you all can see, it is the square of standard deviation.
CovarianceThe last two method that we have looked at are purely 1-dimensional.
You can apply them on the data like: height of people in the room, marks obtained by a class etc.
But data sets in real life have more than one dimension, and the aim of statistical analysis is to find the relationship between these dimensions.
For example, we might have a data with two dimension like number of hours student study and marks obtained by them, we want to analyse them to find a relation and that’s where covariance is used.
Covariance always measured between 2 dimensions.
If you calculate covariance between one dimension and itself you will get variance.
So, if you had 3-dimensional data set (x,y,z) then you could measure covariance between x and y dimension, y and z dimension, x and z dimension and also in between with itself.
Formula to measure covariance between two dimensions X and Y is:Formula for CovarianceSo what does it tell us?.The exact number is not as important as the sign(positive or negative).
If the value is positive that indicates that both dimensions increase together and negative indicates that one dimension increases while other decreases.
So for above example, if we get the value positive that indicates that the marks of student increases if number of hours increases.
Covariance MatrixAs I have already told you that covariance can be measured in between two dimension.
So for high dimensions you get different covariance values for different combination for dimensions.
A useful way to get all the possible covariance values between all different dimensions and put them in a matrix.
So the covariance matrix for a set of data with n dimensions is:Cⁿˣⁿ = (cᵢ,ⱼ, cᵢ,ⱼ = cov(Dimᵢ, Dimⱼ)),So covariance matrix for a 3 dimensional data set(x,y,z) is:Covariance matrixSome points to note: Down the main diagonal, you see the covariance value is between one of the dimension and itself which is equal to variance of that dimension.
The other point is that since cov(a,b) = cov(b,a), the matrix is symmetrical about the main diagonal.
Mathematical Explanation of PCAFinally its’s time for final showdown.
For simplicity I am going to use only 2 variables here.
Step 1: Get some dataI am using this made up data set here:Step 2: Subtract the meanFor PCA to work properly, you have to subtract the mean from each data dimensions.
So all the x values subtracted mean of x values and similarly all the y values subtracted mean of y values.
This produces a data set whose mean is zero.
Adjusted DataPlot of original dataStep 3: Calculate the covariance matrixAs you already learnt about this in above topic, it is calculated in the same way.
The data is 2-dimensional so the matrix will be 2×2.
For the above data you will get this matrix:Since the non-diagonal elements are positive, we should expect that x and y increases together.
Step 4: Perform Eigendecomposition of the covariance matrixSince the covariance matrix is square, we can calculate eigenvectors and eigenvalues for this matrix.
Here are the eigenvalues and eigenvectors for above covariance matrix:It is important to notice that both these eigenvectors are of unit length.
EIgenvectors on the plot, one of them represent best fitSo what do they mean?.If you look into the above plot, you can see how the data has quite a strong pattern.
Both the dotted lines are the eigenvectors and as I have told you in the previous part they are orthonormal.
They are also providing us information like you can see one of the eigenvector goes through the origin and gives us the best fit, that showing us how the data related strongly.
So, by this process of taking the eigenvectors of the covariance matrix, we have been able to extract lines that characterise the data.
Step 5: Choosing components and forming a feature vectorThis step provide you a notion of dimension reduction and this can be possible by the eigenvalues and eigenvectors that we get above.
The eigenvector with highest eigenvalue is the principal component of the data set or in simpler words it provides most of the information or have more significance.
So for above example, we can say that the eigenvector which was providing the best fit have largest eigenvalue.
Generally in this step, you have to order the eigenvalues from highest to lowest.
Now if you like, you can decide whether to ignore the components of lesser significance or not, you do lose some information but if the eigenvalue is small you don’t lose much information.
Now you need to form a feature vector, which is formed by taking the eigenvectors that you want to keep and forming a matrix with these in the columns:FeatureVector = (eig₁ eig₂….
eigₙ)From above eigenvectors we have two choices either to keep both the eigenvectors or leave out the smaller, less significant component which is:Step 6: Deriving the new data setThis is the final and easiest step in PCA.
The final data looks like this:FinalData = RowFeatureVector X RowDataAdjustWe simply take transpose of feature vectors that we get above and left multiplied it with transpose of mean adjusted data, calculated earlier.
The transformed data and plot we get for our above example is:Plot after conversion4.
Recover the old dataYou now know how to convert data by PCA, but now it’s time to learn how to get it back in its original form.
Before we proceed you should remember that if you have reduced number of eigenvectors in final transformation, then the retrieved data has lost some information.
Recall the final transformation is:FinalData = RowFeatureVector X RowDataAdjustFor retrieving the data back we can turn it around like this:RowDataAdjust = RowFeatureVector⁻¹ X FinalDatawhere RowFeatureVector⁻¹ is the inverse of RowFeatureVector.
However, we know that these are the eigenvectors and they are orthonormal which meansRowFeatureVector⁻¹ = RowFeatureVectorᵗ,So we can write it like this:RowDataAdjust = RowFeatureVectorᵗ X FinalDataBut this is the data after subtracting mean from original data, so to get the original data we need to add that mean:RowOriginalData = (RowFeatureVectorᵗ X FinalData) + OriginalMeanThis plot shows the retrieved data and as you can see that it is different from original one because of some information lose:Plot after data recovery5.
Relation between PCA & SVDWhile the eigendecomposition of covariance matrix may be more intuitive, most PCA implementations perform a Singular Vector Decomposition (SVD) to improve the computational efficiency.
Let X be the data matrix of shape n x p and C is the covariance matrix of shape p x p which is given by:C = XᵗX / (n-1)As we perform singular value decomposition of X now, we get:X = UDVᵗNow if we put the value of X in C, then we get:C = (VDᵗUᵗ)(UDVᵗ) / (n-1)Since D is a diagonal matrix, so Dᵗ = D,C = VD²Vᵗ/(n-1)Now if we compare it with eigendecomposition then we get,λ = D²/(n-1)So, principal components are given by,XV = UDVᵗ(V) = UDBy this we can established the connection between SVD and PCA.
Note:- I also implemented PCA from scratch by Python that you can fork from here.
This concludes the third part of this series.
Here I covered some advanced topics of linear algebra and a detailed explanation of Principal Component Analysis.
I am currently working on next part of this series and will update the link as it will completed.
I hope you enjoy this part and we will learn more in the upcoming part too.
ReferencesDeep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Cournville.
PCAA Tutorial on Principal Component Analysis by Jonathon Shlens.
Principal Component Analysis by Lindsay I Smith.
PCA by Kristjan Korjus.
PCA in 3 simple steps by Sebastian Raschka.
Stackoverflow answer on explanation of PCA.
Relation between PCA and SVD by Andre Perunicic.
Stackoverflow answer in relation between PCA and SVD.
SVDSingular Value Decomposition by MITOpenCourseWare.
SVD Lecture by Stanford University.
Moore-Penrose PseudoinverseTutorial on Moore -Penrose Pseudoinverse.
The Moore-Penrose Pseudoinverse: A Tutorial Review of the Theory.
Derivation of Moore-Penrose Pseudoinverse.
StatisticsStandard Deviation by Robert Niles.
Finding Eigenvalues and Eigenvectors.