“Interesting” Projections — Where PCA Fails.

An attractive alternative for exploratory data analysis.

Steve DriscollBlockedUnblockFollowFollowingJan 16Most data scientists are familiar with principal components analysis (PCA) as an exploratory data analysis tool.

A recap for the uninitiated: researchers often use PCA for dimensionality reduction in hopes of revealing useful information in their data (e.

g.

disease vs.

non-disease class separation).

PCA does this through finding orthogonal projection vectors that account for the maximum amount of variance in the data.

In practice, this is usually done using singular value decomposition (SVD) to find the principal components (eigenvectors) weighted by their contribution (eigenvalues) to the total variance in the data.

Undoubtedly, PCA is the most used tool for data analysis in my field (chemistry) and many others, but what happens when it doesn’t work?.Does it mean the sampling experiment was bad?.No.

Does it mean there isn’t useful information int he data?.No.

Our group at Dalhousie University works on developing new data analysis tools for chemistry.

Today, I am going to tell you about an alternative to PCA called projection pursuit analysis (PPA).

General factor analysis modelPCA Operates on VarianceWhere does PCA fail?.As mentioned previously, PCA operates by finding the direction of maximum variance in your data.

What if the projection onto that direction is not useful?.The graphic below is comprised of simulated data (200 samples) that form two separated clusters that have greater variance along the y-axis than along the x-axis.

If we do PCA on this 2-dimensional data, the projection vector we obtain, v, will be a 2 x 1 column vector ([0; 1]).

The original data, X (200 x 2), projected onto this vector gives us our scores T=Xv.

Visualizing these scores shows that there is no apparent separation between the two clusters.

Instead, if we project onto the x-axis (v=[1; 0]) then it is easy to see the separation in the clusters.

How do we find this vector in high-dimensional data?“Interesting” projections are projections that reveal, for example, class information.

Projection pursuitProjection pursuit, originally proposed by Friedman and Tukey (1974), attempts to find “interesting” projections in the data according to maximizing or minimizing a projection index.

By extension, in the PCA framework the projection index (variance) is maximized.

The question now is, “what are good projection indexes?”.

Plenty of research has been done on defining new projection indexes, but the one I will focus on today, which has been proven to be useful for exploring chemical data, is kurtosis.

Kurtosis-based projection pursuitThe fourth statistical moment, kurtosis, has proved useful as a projection index ( https://www.

sciencedirect.

com/science/article/pii/S0003267011010804).

Univariate kurtosisWhen kurtosis is maximized, it tends to reveal outliers in your data.

Somewhat useful, but not really what we are looking for to reveal class or cluster information.

However, when kurtosis is minimized, it separates data in to two groups in one dimension (4 groups in 2 dimensions and 8 groups in 3 dimensions).

Kurtosis minimizationThe big question is how to search for these projection vectors using kurtosis?.Quasai-power learning algorithm.

See https://www.

sciencedirect.

com/science/article/pii/S0003267011010804.

In this paper, Hou and Wentzell show that the projection vectors that minimize kurtosis can be found using the following learning algorithm:Finding the projection vector that minimizes kurtosisExample simulationLets simulate some data and apply both PCA and PPA.

Similar to the opening graphic, our data will have 2 classes (100 samples in each class) and only 1 dimension will be needed to reveal the class separation.

The first class will be centered at -4 on the x-axis with a standard deviation of 5 and the second class will be centered on +4 with a standard deviation of 5.

Original dataTo make this more realistic, lets rotate this 200 x 2 matrix into 600 dimensions by multiplying by a 2 x 600 random rotation matrix.

This is where we now need to utilize our exploratory tools to find some interesting projections of our data.

First, lets column mean centre our data and apply PCA and visualize the first component as a function of sample number.

First component from PCAWe see that projecting the data down onto the first PC would reveal no class information.

Lets apply PPA now.

First scores from PPAPPA is able to find the projection that is useful to us (i.

e.

the one that provides the class separation).

Where PPA has troubleAlthough PPA has been found to perform better than PCA in most cases, there is some important notes to make on when PPA does not work.

PPA does not work well when the class sizes are no equal, for example, if I make a 5:1 class ratio in the example above and apply PPA we get:PPA also struggles when the number of classes is not a power of 2 due to the geometry of the separation.

PPA also struggles with over fitting and data compression usually needs to be performed (roughly 10:1 sample to variable ratio is needed).

If not, the algorithm will artificially push samples into corners.

Current work in our group is on developing methods to alleviate these problems and (good news) we should be publishing a few papers on this in the coming months!.I will be sure to keep you all updated.

If anyone would like MATLAB code to perform PPA let me know.

It is freely available.

Steve.