Why correlation might tell us nothing about outliersGevorg YeghikyanBlockedUnblockFollowFollowingJun 6IntroductionWe often hear claims à la “there is a high correlation between x and y .
” This is especially true with alleged findings about human or social behaviour in psychology, the social sciences or economics.
A reported Pearson correlation coefficient of 0.
8 indeed seems high in many cases and escapes our critical evaluation of its real meaning.
So let’s see what correlation actually means and if it really conveys the information we often believe it does.
Inspired by the funny spurious correlation project as well as Nassim Taleb’s medium post and Twitter rants in which he laments psychologists’ (and not only) total ignorance and misuse of probability and statistics, I decided to reproduce his note on how much information the correlation coefficient conveys under the Gaussian distribution.
Bivariate Normal DistributionLet’s say we have two standard normally distributed variables X and Y with covariance structureDue to the variables being standard normal, the correlation is ????=0.
If we hear someone reporting this correlation between, say, IQ and “success” (whatever it means), it would probably sound convincing to most of us.
In other words, given a convincing correlation of 0.
8, we would be prone to believe that a “high” IQ score would in most cases predict “high success”.
This is where we would be wrong.
Let’s visualize the bivariate distribution of X and Y:Standardized Bivariate Normal Distribution with ρ=0.
8Proportion of uncertaintyIn order to understand what the correlation tells us at different intervals of the domain of the data distribution, let’s consider the ratio of the probability of both X and Y exceeding a certain threshold K under a correlation structure ????, over the probability of both X and Y exceeding this threshold given ????=1.
Taleb calls this ratio the “proportion of uncertainty”:Before moving on to evaluate ????(????, ????), let’s first take a look at what the threshold K represents:Standardized Bivariate Normal Distribution with ρ=0.
8 and probability threshold K=2In the image above, K=2, and the shaded region represents the subset of the sample space for which both X > K and Y > K.
In order to evaluate ????(????, K), we notice that the joint probabilityis impossible to integrate analytically, so we have to resort to numerical computation.
Let’s see how we can do it.
Having ????(????, ????), we plot it against ????, and obtain:Information conveyed by correlation under the Gaussian distributionConclusionWhat we can see from the plot is that the information conveyed by the correlation between X and Y behaves disproportionately.
From a practical point of view, this means that a correlation of 0.
5, for instance, carries very little information (???? is somewhere between 0.
1 and 0.
3) for ordinary values (up to two standard deviations away) and carries essentially no information about the tails (i.
outliers or outperformers).
In other words, observing a rather high value for X (e.
, 4–5σ), we cannot claim it to predict a high Y.
Returning to Taleb’s attack on the validity of psychometric tests, the result obtained above means, to quote Taleb, that you need something >.
98 to “explain” genius.
The complete Jupyter Notebook with all the code for this post can be found here.
.. More details