This, again, would be a bad feature.
So PCA looks for properties that allow to reconstruct the original characteristics as well as possible.
comWe should be able to get as much variation as possible in our new properties and we should be able to reconstruct our original properties from them.
They might seem two different processes but lets see how they are tackling the same thing and are not two different goals.
Look at the figure below, lets say each of the point below represents a wine, the x axis determines the acidity level and the y axis represents the tannin levels.
We don’t know whether they are related or not but lets assume that they are.
You can see from the plot above that there seems to be some relation between the two.
There is a general trend with which all the points are spread.
Totally uncorrelated points do not show any trend.
Let us draw an arbitrary black line through the set of points and drop perpendicular lines to this newly constructed black line from all of the points.
The point where these red lines touch the black line is called the projection of that point on the line.
This projection is the new property or characteristic which the PCA will construct.
This projection will be a linear combination of the original variables x and y and will be of the form ax+by.
How do we find such a black line where these projections have the maximum spread or variance and which also allows us to reconstruct our original points from the projections easily.
Look at the makeshift animation above for a while.
As the black line rotates the spread of the projections also change on the black line.
If you observe closely the maximum spread or variation of the red dots occur when the black line is roughly at 2 o’clock.
The red lines perpendicular to the black line depict the distance of the projection point from the blue point which is also the error in reconstruction of the original point from the projection.
Stare at the animation for a few minutes and you will notice that the sum of the distances of the projection also occurs when the black line is at 2 o’clock.
This sum of distances is also the total reconstruction error.
PCA finds this black line and the projections for which the variance of the projections is maximum and the reconstruction error is minimum.
Two birds with one stone!You can now use these red projections as your new features for your machine learning algorithm.
For our wine example PCA has reduced 2 dimensional data to a single dimension.
This technique can be used for higher dimensional data to project it on to a lower dimensional space which makes visualization much easier.
Ending remarksThe above introduction to PCA merely scratches the surface of a rather powerful dimensionality reduction technique.
PCA is highly used in data science and Machine learning to extract orthogonal features out of high dimensional data.
I have deliberately tried to avoid the mathematics and rigor of the PCA methodology.
I would encourage you to look it up and go through the mathematics as well and understand eigenvectors and eigenvalues.
Below are some great resources to understand PCA and related topics.
ReferencesI am a big fan of 3blue1brown and this video series about linear algebra takes your understanding of matrices and its operations to the next levelThis website has a succinct and nice visual description about PCAThe same website with explanation about eigenvalues and eigenvectorsTrue to its title Matt Brems in his blog A One-Stop Shop for Principal Component Analysis has beautifully explained everything related to PCAX8 aims to organize and build a community for AI that not only is open source but also looks at the ethical and political aspects of it.
More such simplified AI concepts will follow.
If you liked this or have some feedback or follow-up questions please comment below.