Remember that component 1 is the principal component with the highest variance (since highest variance equates to highest potential signal).

The linear regression connection is useful because it helps us realize that each principal component is a linear combination of the individual features.

So much like how a linear regression model is the weighted sum of our features that adheres most closely to our target variable, the principal components are also weighted sums of our features.

Except in this case, they are the weighted sums that best express the underlying trends in our feature set.

Going back to our example, we can visually see that the blue line captures more variance than the red line because the distance between the blue ticked lines is longer than the distance between the red ticked lines.

The distance between the ticked lines is an approximation of the variance captured by our principal component — the more the black dots, our data, vary along the principal component’s axis, the more variance it captures.

Now for component 2, we want to find the second strongest underlying trend with the added condition that it is uncorrelated to component 1.

In statistics, trends and data that are orthogonal (a.

k.

a.

perpendicular) to each other are uncorrelated.

Orthogonal DataCheck out the plot to the left.

I have plotted two features, one in blue and the second in red.

As you can see, they are orthogonal to each other.

All of the variation in the blue feature is horizontal and all the variation in the red one is vertical.

Thus, as the blue feature changes (horizontally), the red feature stays completely constant as it can only change vertically.

Cool, so in order to find component 2, we just need to look for a component with as much variance as possible that is also orthogonal to component 1.

Since our earlier PCA example was a very simple one with just two dimensions, we have only one option for component 2, the red line.

In reality, we probably have tons of features so we would need to consider many dimensions when we search for our components but even then, the process is the same.

An Example to Tie it All TogetherLet’s go back to our earlier example with stocks.

Instead of just Apple stock, I’ve downloaded data for 30 different stocks representing multiple industries.

If we plot all their daily returns (for 100 days, same as above), we get the following mess of a chart:100 Days of Stock Returns for 30 Different StocksEvery stock is sort of doing its own thing and there is not much to glean from this chart besides that daily stock returns are noisy and volatile.

Let’s use sci-kit learn to calculate principal component 1 and then plot it (PCA is sensitive to the relative scale of your features — since all my features are daily stock returns I did not scale the data but in practice, you should consider using StandardScaler or MinMaxScaler).

The black line in the figure below is component 1:Stock Returns with PCA Component 1 (In Black)So the black line represents the strongest underlying trend in our stock returns.

“What is it though?”, you ask?.Good question and unfortunately without some domain expertise, we don’t know.

This loss of interpretation is the key drawback of using something like PCA to reduce our much larger feature set into a smaller set of key underlying drivers.

Unless we are lucky or just plain experts of the data, we would not know what each of the PCA components means.

In this case, I would guess that component 1 is the S&P 500 — the strongest underlying trend in all our stock returns data is probably the overall market, whose ebbs and flows impact the prices of each individual stock.

Let’s check this by plotting the S&P 500’s daily returns against component 1 (below).

Almost a perfect fit considering how noisy the data is!.The correlation between the S&P 500’s daily returns and principal component 1 is 0.

92.

The S&P 500 and Component 1 are Extremely CorrelatedSo like we guessed, the most important underlying trend in our stock data is the stock market.

The scikit-learn implementation of PCA also tells us how much variance each component explains — component 1 explains 38% of the total variance in our feature set.

Let’s take a look at another principal component.

Below, I have plotted components 1 (in black) and 3 (in green).

As expected, they have a low correlation with each other (0.

08).

Unlike component 1, component 3 only explains 9% of the variance in our feature set, much lower than component 1’s 38%.

And unfortunately, I have no idea what component 3 represents — this is where PCA’s lack of interpretation comes to bite us.

Principal Components 1 and 3ConclusionIn this post we saw how PCA can help us uncover the underlying trends in our data — a super useful ability in today’s big data world.

PCA is great because:It isolates the potential signal in our feature set so that we can use it in our model.

It reduces a large number of features into a smaller set of key underlying trends.

However, the drawback is that when we run our features through PCA, we lose a lot of interpretability.

Without domain expertise and a lot of guessing, we probably wouldn’t know what any of the components beyond the top one or two represents.

But generally this is not a deal breaker.

If you are convinced that there is ample signal in your large set of features, then PCA remains a useful algorithm that allows you to extract most of that signal to use in your model without having to stress about overfitting.

More Data Science Related Posts:Understanding Random ForestUnderstanding Neural NetsUnderstanding Logistic RegressionUnderstanding A/B Testing.. More details