I hope you chose the black line as it is closer to most of the data points (see figure below).

The point where the red tick intersects the black/orange line is the projection of corresponding blue data point.

Pay attention to the projection distances or the size of red ticks that connect each data point to the black and orange lines respectively.

The black line minimizes the cumulative projection distances of the data points.

In other words, the black vector minimizes the projection error or information loss as we move from representing our data from 2D to 1D.

It should be noted that the variance or the ‘spread’ of data is maximum along the black line.

If this interpretation is not very obvious, the following animation may help.

It shows how the ‘maximum variance’ and ‘minimum projection error’ are reached at the same time, that is along the magenta ticks on either sides of the data cloud.

Courtesy : Making sense of principal component analysis, eigenvectors & eigenvaluesThus a desirable property of matrix Y is that the new feature or its first principal component should be along the line that minimizes projection error, while simultaneously maximizing variance of the projected data.

RedundancyPCA exploits the inherent redundancy in a dataset to reduce dimensions.

Consider the following plots of 2D feature spaces covering the possible spectrum of data redundancies.

Figure A plots Staff salaries vs Room area (sq ft), that are uncorrelated with each other.

The two dimensions in Figure A do not exhibit any redundancy and cannot be subject to dimension reduction by PCA.

On the extreme side, Figure C is a plot of Room area in square metres vs Room area in square feet.

There is complete correlation between the two features, therefore in this scenario, it is safe to eliminate one of the two as they both are essentially giving the same information.

An ideal scenario where the role of PCA is appreciated is Figure B which is the same plot as our previous example, Room price ($) vs Room area (sq ft).

Figure B shows some correlation between the two features, indicating that the data can be re expressed by a new feature that is a linear combination of the old features.

Thus, when we change our basis from 2D to 1D by projecting each data point onto the black line as shown previously, we are also eliminating feature redundancy.

As you can observe in Figure B, the data points become uncorrelated precisely along the black line passing through the data cloud.

The same is demonstrated in the reoriented figure below,Variance is the spread of data for one variable, whereas Covariance is a measure of how two variables vary together.

If we denote the features Room area (sq ft) and Room price ($) as variables x and y respectively,and the Covariance matrix can be computed as follows,Note that I haven’t gone into the details of mathematical calculations of variance and covariance as it is trivial.

The take home message is that a covariance matrix is always symmetric with all the entries in the main diagonal being the variances of each variable.

All the other entries are covariances of each pair of variables.

Coming back to our example of Room area (sq ft) vs Room prices ($), once we change the basis and reduce dimensions from 2D to 1D; the features become uncorrelated to each other or in other words the covariance is 0.

The covariance matrix is therefore a diagonal matrix.

Summarizing, the properties that we want the features/ principal component of matrix Y to exhibit are:The principal component should be along the direction that maximizes the variance of projected data.

Features of matrix Y should be uncorrelated with each other, i.

e.

its covariance matrix should be a diagonal matrix.

Let us revisit the mathematical representation of the goal for PCA that we derived earlier,X is the original dataset with n numbers of features.

P is a transformation matrix that is applied to matrix X.

Matrix Y is the new dataset with n numbers of new features/principal components.

We have established the properties of features in matrix Y.

The goal is to reduce redundancy or more precisely the covariance matrix of matrix Y (let’s call it Sy) is diagonal.

Therefore, in our equation PX = Y, the choice of matrix P should be such that Sy is diagonalized.

We know that a symmetric matrix is diagonalized by a matrix of its orthonormal eigenvectors.

Recall the Spectral theorem of Linear Algebra we learnt in Post 2,If A is symmetric, then A is orthogonally diagonalizable and has only real eigenvalues.

This was indeed the last piece in our puzzle.

In theory, PCA assumes that all the basis vectors i.

e.

the rows of matrix P are orthonormal eigenvectors of covariance matrix of X.

Applying a transformation P to X results in a matrix Y such that Sy is diagonalized.

Secondly, it assumes that the directions with largest variances are the most important or ‘most principal’.

The rows of matrix P are rank ordered in terms of its corresponding variances or eigenvalues in this case.

By eliminating the rows in matrix P with low eigenvalues, the directions in which variance is low are ignored.

This makes it possible to effectively reduce the number of dimensions without significant loss in information.

ReferencesA Tutorial on Principal Component Analysis, an excellent tutorial for someone interested in further reading on the topic.

No References list is perhaps complete without this undeniably awesome answer on stack exchange Making sense of principal component analysis, eigenvectors & eigenvaluesPrincipal Component Analysis (PCA).

.. More details