We’ll answer that question in this article.
We’ll also develop an intuitive feel for the equation for Pearson’s correlation coefficient.
Sachin DateBlockedUnblockFollowFollowingJun 26When you dive into the sea of knowledge that is data science, one of the first fish you spot is correlation and it’s cousin, auto-correlation.
Unless you take some time out to get to know them, it is impossible to get much done in data science.
So let’s get to know them.
In the most general sense, a correlation between two variables can be thought of as some kind of a relationship between them.
when one variable’s value changes, the other one’s value changes in a predictable manner, most of the time.
In practice, the word correlation is usually used to describe linear relationships (and sometimes, nonlinear relationships) between variables.
I’ll come to the linearity aspect in a minute.
Meanwhile here is an example of two possibly correlated variables.
We say ‘possibly’ because it is a hypothesis that must be tested and proven.
Relationship between City and Highway fuel economy of passenger vehicles.
Source: UC Irvine ML RepositoryLet’s draft a few informal definitions.
Linear Correlation: If values of two correlated variables change at a constant rate with respect to each other they are said to have a linear correlation with each other.
With Linear Correlation in mind, let’s revisit our example:Possibly linearly correlated variables.
Source: The Automobile Data Set, UC Irvine ML RepositoryIf the correlation in this case is linear, a Linear Regression Model (i.
a straight line), upon being fitted to the data, ought to be able to adequately explain the linear signal in this data set.
Here is how the fitted model (black line) would look like for this data set:A Linear Regression Model fitted to 80% of the data points in the City versus Highway MPG data setIn the above example you can now use the fitted model to predict Highway MPG values corresponding to City MPG values that the model has not seen but which are within the range of the training data set.
Here is the plot of the predictions of the fitted Linear Model on a hold-out data set which contains 20% of the original data that the model did not see during the fitting process.
Actual versus predicted Highway MPG on the 20% hold-out setFor the programmatically inclined, the following Python code produced these results.
You can get the data used in the example from here.
If you use this data in your work be sure to do shout-out to the folks at the UC Irvine ML repository.
Now let’s look at nonlinear relationships.
Nonlinear correlation: If the values of correlated variables do not change at a constant rate with respect to each other they are said to have a nonlinear relationship or a nonlinear correlation with each other.
Here is an example of what looks like a case for nonlinear correlation.
Unless one transforms the dependent variable (in our example — it is Highway MPG) so as to make the relation linear, a Linear Regression Model will not be able to adequately ‘explain’ the information contained within such nonlinear relationships.
Positive Correlation: For two correlated variables, when one variable’s value increases (or decreases), then most of the time if the other variable’s value is also seen to respectively increase (or decrease), then the two variables can be said to be positively correlated.
Here is an example that suggests a positive correlation between the two variables:Two variables that appear to be positively correlated.
Source: The Automobile Data Set, UC Irvine ML RepositoryNegative Correlation: For two correlated variables, when one variable’s value increases (or decreases), then most of the time if the other variable’s value is seen to respectively decrease (or increase), then the two variables are said to be negatively correlated.
Here is an example that suggests a negative correlation:Two variables that appear to be negatively correlated.
Source: The Automobile Data Set, UC Irvine ML RepositoryMeasuring the amount of correlationLet’s look at the following two scatter plots.
Both plots seem to suggest a positive correlation between the respective variables.
But the correlation is stronger in the first plot as the points are more tightly packed together along an invisible straight line slicing through the points.
The coefficient of correlation between two variables quantifies how tightly coupled are the movements of the two variables with respect to each other.
The formula for the coefficient of correlation between two variables that have a linear relationship is:Formula for the coefficient of correlation between variables X and YThe two sigmas in the denominator are the standard deviations of the respective variables.
When calculated in this way this coefficient is called the Pearson’s coefficient of correlation and it is represented by the symbol ‘r’ when used for the sample and by the symbol rho when used for the entire of population of values.
Note that when you want to use the Pearson’s correlation coefficient to calculate the correlation for the population be sure to use the formulae for the population while computing the covariance and the standard deviations.
The value of this coefficient ranges smoothly from [-1.
0 to 1.
When the variables are negatively correlated r=[-1, 0), when they are positively correlated r=(0, +1].
When they are not linearly correlated r=.
Let me emphasize that last bit again: When two variables are not linearly correlated, the Pearson’s coefficient’s value is zero and vice versa.
Intuition for the Pearson’s coefficientTo really understand what’s going on inside the Pearson’s formula one must first understand covariance.
Just like correlation, the covariance between two variables measures how tightly coupled are the values of the two variables.
When used for measuring the tightness of a linear relationship between two variables, covariance is calculated using the following formulae:Let’s break down these formulae term by term:As mentioned before, covariance measures how synchronously the values of variables change w.
Since we want to measure the change in value, the change must be anchored with respect to a fixed value.
That fixed value is the mean of that variable’s data series.
For the sample covariance, we use the sample mean, and for the population covariance, we use the population mean.
Using the mean as the goal post also centers each value around it’s mean.
This explains the subtraction of X and Y from their respective means in the numerator.
The multiplication of the centered values in the numerator ensures that the product is positive when both X and Y rise or fall above the mean together.
If X rises but Y falls below the respective mean, the product is negative.
The summation in the numerator ensures that if the positive valued products more or less balance off the negative valued products, the net sum is going to be a tiny number implying that there is no dominant positive or negative pattern in the way the two variables are moving w.
In this case the covariance value will be small.
On the other hand if the positive products dominate over the negative products then the sum will be a large positive or a large negative number signifying a net positive or a net negative pattern of movement between the two variables.
Finally, the n or the (n-1) in the denominator averages things out over the available degrees of freedom.
In the sample, one degree is used up by the sample mean so we divide by (n-1).
Covariance is wonderful, but…Covariance is a wonderful way to quantify the movement of variables with respect to each other but it has some problems.
Covariance is difficult to interpret when the units of the two variables are different.
For instance if X is in dollars and Y is in pound-sterling the unit of covariance between X and Y becomes dollar times pound-sterling.
How can one possibly interpret that?.Even when both X and Y have the same unit, say dollar, the units of covariance becomes…dollar times dollar!.Still not easy to understand.
Bummer!There is also the problem of range.
When X and Y vary over a small interval, say [0,1] you will get a deceptively tiny covariance value even if X and Y move together very tightly.
Finally, because X and Y can have different units and quite possibly a different range, it is often impossible to objectively compare the covariance between one pair of variables with that of another pair of variables.
For example say I want to compare how much stronger or weaker is the linear relation between a vehicle’s fuel economy and it’s vehicle length, as compared to the relation between the fuel economy and curb weight.
Using covariance to do this comparison will require to compare two values in two different units and two different ranges.
That can be problematic, to say the least.
Clearly there is a need to re-scale the covariance so that the range is standardized and also to solve it’s ‘units’ problem.
Enter Standard Deviation.
In simple terms, standard deviation measures the average departure of the data from its mean.
Standard deviation also has the nice property that it has the same unit as the original variable.
So let’s divide the covariance by the standard deviations of the two variables.
Doing so will re-scale the covariance so that it is now expressed in multiples of the standard deviation, and it will also cancel out the units of measurement from the numerator.
All troubles with covariance solved in two simple divisions!.Here is the resulting formula:Now where have we seen this formula before?.It is of course the Pearson’s correlation coefficient!Auto-correlationAuto or self correlation is the correlation of a variable with a value that the variable took on, X units (of time) in the past.
For example air-temperature of a place might be auto-correlated with the air temperature of the same place 12 months ago.
Auto-correlation has meaning for variables which are indexed to a scale that can be ordered, i.
an ordinal scale.
The time scale is an example of an ordinal scale.
Just like correlation, auto-correlation can be linear or nonlinear, positive or negative, or it can be zero.
The formula for auto-correlation when used for a linearly auto-correlated relationship between a variable and a k-lagged version of itself is as follows:Formula for k-lagged auto-correlation of YLet’s develop our understanding of auto-correlation a little further by looking at another data set:Monthly average maximum temperature of Boston, MA from Jan 1998 to Jun 2019.
Weather data source: National Centers for Environmental InformationThe above plot shows the monthly average maximum temperature of the city of Boston.
It is calculated by averaging over each month, the daily maximum temperature recorded by a weather station in that month, taken over a period that stretches from January 1998 through June 2019.
Let’s plot the temperature against a time lagged version of itself for various lags.
Monthly average maximum temperature of Boston, MA plotted against a lagged version of itself.
Source: National Centers for Environmental InformationThe LAG 12 plot shows a strong positive linear relationship between the average maximum temperature for a month and the average maximum of the same month one year ago.
There is also a strong negative auto-correlation between data points that are six months apart i.
at LAG 6.
Overall there is a strong seasonal signal in this data as one might expect to find in weather data of this kind.
Following is the auto-correlation heat map showing the correlation between every combination of T and T-k.
For us the column of interest is outlined in blue.
Correlation heat mapWithin the first column, the square of interest is the one at (Monthly Average Maximum, TMINUS12) and maybe the one at (Monthly Average Maximum, TMINUS6).
Now if you refer back to the scatter plot collage, you will notice that the relationship for all other combinations of lags is nonlinear.
So in any linear seasonal model we will attempt to build for this data, the utility of the correlation coefficient values that were generated for these nonlinear relationships (i.
for the remaining squares in the heat map) is severely limited and they should not be used even if some of them have large values.
Remember that (auto)correlation coefficients, when calculated using the formulae that were mentioned earlier are useful only when the relationship is linear.
If the relationship is nonlinear we need a different method to quantify the strength of the nonlinear relationship.
For example, the Spearman’s rank correlation coefficient can be used to quantify the strength of the relationship between variables that have a nonlinear, monotonic relationship.
Finally, here is the Python code for plotting the temperature time series, the scatter plot collage and the heat map:And here is the data set.
Finally a word of caution.
A correlation between two variables X and Y, whether it is linear or nonlinear does not automatically imply a cause-effect relationship between X and Y (while the reverse is true).
Even when there is a large correlation seen between X and Y, X may not be directly influencing Y or vice versa.
Maybe there is a hidden variable, called a confounding variable, that is simultaneously influencing both X and Y so that they rise and fall in sync with each other.
For illustration, consider the following graph that shows two data sets plotted against each other.
Source: World BankHere X is a time series that ranges from 1990 to 2016 and contains the fraction of the world’s population had access to electricity in each of those years.
The variable Y is also a time series that ranges from 1990 to 2016 and contains the strength of the world-wide labor force in each of those years.
The two data sets are obviously highly correlated.
You be the judge of whether there is any cause and effect!.