Principal Component Analysis from Statistical and Machine Learning Perspectives (Part 1)Suhyun KimBlockedUnblockFollowFollowingJan 5One of the common problems in analysis of complex data comes from a large number of variables, which requires a large amount of memory and computation power.
This is where Principal Component Analysis (PCA) comes in.
It is a technique to reduce the dimension of the feature space by feature extraction.
For example, if we have 10 variables, in feature extraction, we create new independent variables by combining the old ten variables.
By creating new variables it might seem as if more dimensions are introduced, but we select only a few variables from the newly created variables in the order of importance.
Then the number of those selected variables is less than what we started with and that’s how we reduce the dimensionality.
There are multiple ways of looking at PCA.
In this article, I will discuss PCA from the statistical perspective and the part 2 of this article will be the machine learning perspective (minimum-error formulation and maximization of eigenvalues) and it will be pretty technical.
Familiarity with some of the following will make this article easier to understand: linear algebra (matrices, eigenvectors/eigenvalues, diagonalization, orthogonality) and statistics/machine learning (standardization, variance, covariance, independence, linear regression).
PCA the Statistical PerspectiveIn order to understand PCA, we can start with an example from this linear algebra textbook in which the author collected a set of test data from his Honors Calculus class of 14 students.
The four variables below are ACT ( a score from a national test, with range 1 to 36), FinalExam (the final exam score with range 0 to 200), QuizAvg (the mean of eight quiz scores, each with range 0 to 100), and TestAvg (the mean of three test scores, each with range 0 to 100).
We have all these test scores, but some of the test scores are redundant and it makes it hard to visualize.
However, we can’t just drop some test scores randomly.
That’s where dimensionality reduction plays a role to reduce the number of variables without losing too much information.
Another problem here is that it’s hard to interpret those test scores, because they are based on different ranges and scales.
Variables such as the test scores above measured on different scales or on a common scale with widely differing ranges are often standardized to refer to them with the same standard.
StandardizationZ-scoring is a common method used to standardize/normalize data: the mean of the data is subtracted from each value and divided by the standard deviation.
The mean is defined to beThe variance of the data set x isAfter calculating the mean and square root of the variance of the data, which is the standard deviation:Now the data can finally be z-scored/normalized by having the mean subtracted and divided by the standard deviation, which is the square root of the variance.
Covariance/CorrelationIn order to see the relation between all those variables, covariance is used.
Covariance tells you about linear relationships between two variables: how two variables were closely related.
The sample covariance of the vector x and y is defined to beHowever, one problem with using covariance is that it’s affected by the units of measurements.
For instance, if the vector x is measured in kilograms and y is measured in feet, the covariance is now in kilograms-feet.
To make the units go away, the sample correlation (vector) is used, which is defined to be covariance divided by the standard deviation:Another way to define the sample correlation (matrix) is using z-scored data, which is already divided by standard deviation.
And here’s the correlation matrix of the data set:We can see that the third test score (Quiz Average) and the first test score (ACT) has the highest correlation of 0.
The second highest correlation value is shown between the first and fourth: ACT and Test Average.
So we confirm what we saw by eye earlier that these test scores are related to one another and some of them are redundant.
Now we will commence our principal component analysis.
In order to come up with new dimensions, we will go through two processes:transformation of the data set: define new variables to replace the existing onesselection of the dataset: measure how well the new variables represent the original variables.
The following is the details of each step.
The Transformation StepWe will start from the correlation matrix computed above.
It is important to see that this matrix is necessarily symmetric, because it is obtained from multiplying the transpose of Z and Z.
With that being said, a theorem of symmetric matrices will be used here.
An n x n matrix A is symmetric if and only if there is an orthonormal basis consisting of eigenvectors of A.
In this case, there exists an orthogonal matrix P and a diagonal matrix D such that A = PDP_transpose.
And we can obtain the orthogonal matrix P whose columns are the eigenvectors of A or the correlation matrix and the diagonal matrix D of eigenvalues.
From here, we will transform our original Z-scored data by performing matrix multiplication with P: ZPThe shape of this new set of data doesn’t look any different from the shape of the original dataset.
Now we have successfully transformed our original data into the new one and we will reduce the dimension of the original dataset in the next step.
The Selection StepNow we will choose vectors from ZP to actually reduce the dimension, but how do we know which ones to keep and which ones to drop?.The decision will be made based on the eigenvalues.
We will re-write ZP in a readable format where each column vector of ZP is denoted as y_i.
y1 is the first principal component and is defined to be the new vector whose coefficients are the eigenvector of the correlation matrix with the largest eigenvalue of 2.
Then the second principal component is y2, because its coefficients correspond to the second largest eigenvalue of 0.
In this case, the last two eigenvalues are insignificant because they are not large enough compared to the first two.
Therefore, we will choose the ZP vectors which has eigenvectors as coefficients that correspond to the first two largest eigenvalues, namely y1 and y2.
And those two vectors are our new vectors that we will use as the result of dimensionality reduction.
Now arises the question why do we do ZP to choose a new set of vectors?.Why is ZP so important?.That’s because the covariance of ZP is the diagonal matrix.
Here’s the proof of why (This is the most significant part of the first half):Why do we multiply Z by P?This explains the reason why we multiply the normalized data Z with P.
The fact that the covariance of ZP is the diagonal matrix indicates that there is no linear relationship between variables other than the one with itself.
In other words, there is no relationship among the new variables we created and the new variables are independent of one another.
Therefore, we have successfully come up with a new set of dataset (a transformed dataset) which does not have any correlation between them and we can select variables based on the variance, which is represented by the eigenvalues.
Now let’s talk more about the eigenvalues/variance.
The important thing here is that the total variance of the original dataset is the same as that of the transformed dataset.
That is to say, the sum of the entries of the sum of the entries of the diagonal matrix D is the sum of the variance of the Z scores.
With the covariance matrix of ZP being D, each vector of ZP represents the variance of the transformed data and by choosing the vectors that correspond to the highest eigenvalues, the variance is maximized.
Thus, by selecting the vectors that correspond to highest eigenvalues, we are selecting the new variables that have high fraction of the variance of the transformed dataset divided by the total variance of the original dataset.
Conclusion of PCA the Statistical PerspectiveTrying to capture as much variance as possible is common practice in statistics when replacing original variables with fewer new variables “to account for a high percentage of the variance in the original dataset.
” It makes sense intuitively, because we want to discard similar features, but only keep features with maximum dissimilarity when we are considering to reduce the dimension of the dataset.
In this case, we can see that some of the test scores are highly correlated with each other, so some of the test scores are redundant.
But we’d like to see mathematically why it’s beneficial to maintain as much variance as possible.
The answer to this question is easier to see in PCA of the machine learning perspective, which will be the part 2 of this article.