Did You Know The Importance Of Finding Correlations In Data Science?

Correlations played the biggest role in the financial crises.

During the crises, correlations across the global markets were extremely positive.

As a result, the assets across the world fell down together.

During recession, the correlation between assets completely change.

The correlations for equity and senior tranches increased significantly.

This meant that the losses in one tranche caused losses in the other tranche.

It was not expected at all.

It’s important to model the correlation and calculate it on continuous basis.

Real World Use Case 4As Euro was devalued in 2012, US exporters experienced losses.

When GDP of US was low then Asian and European exporters suffered losses due to the strong correlation between the markets.

It is apparent that knowing about macro level correlation can help us take better investment decisions.

Real World Use Case 5Oil prices were very high during the Middle East uprising.

As as a consequence, airline travel was decreased and it impacted tourism industry in the region badly.

When correlation is modelled accurately and measured frequently then it can help us plan better from unforeseen scenarios.

Real World Use Case 6The price of commodities such as precious metals is negatively correlated with the interest rates.

When the interest rates increase then commodity prices decrease.

The measurement of correlation can help us cut the costs and increase the profits.

Photo by Sebastian Pichler on UnsplashReal World Use Case 7The famous investment theory of Harry Markowitz relies around the concept of calculating correlations to model the co-movements of the assets.

A number of correlation trading strategies (Quanto Strategy) have been invented by the traders.

Successful investors and analysts always attempt to analyse the correlations.

A large number of financial institutions rely on the concept of correlations.

We do not want to put all of our eggs in one basket, implying that we do not want to invest in all of those assets that co-move together in the same direction.

Real World Use Case 8Risk management relies on the exercise of finding the covariance between the assets to model how the assets move with each other.

A large number of hedging strategies are dependent on finding the correlations between the trade and the hedged position.

Special trades have been designed that model the correlation risk, such as correlation swaps and correlation options.

Real World Use Case 9VaR is one of the key risk management tool that helps us find the maximum loss over a holding period for a confidence level.

VaR can be calculated using the Delta-Normal approach.

Delta-normal approach is also known as variance-covariance approach as it relies on finding the variance-covariance of the assets.

Usually a covariance or correlation matrix is fed into the calculation.

The core of capturing risk in markets is dependent on finding accurate correlations.

Photo by Tabea Damm on UnsplashReal World Use Case 10Lastly, I am going to touch up on an important use case.

Bonds, interest rates, credit spreads, stock prices and their returns are all assumed to eventually revert back to their mean value.

All of these variables are known as mean-reverting variables.

Sometimes the variables are correlated to their past values.

Here, the correlation (auto-correlation) measures how strongly correlated the current and past values are to each other.

A number of models such as ARCH and GARCH have been implemented to estimate the autocorrelation.

These models specialise in finding auto-correlations and have been used extensively in the data science world.

If A Successful Data Science Project Is Required To Be Implemented Then One Simply Cannot Ignore CorrelationsPhoto by Tim Marshall on UnsplashNow that we understand how important it is to measure correlation, let’s have a look at different techniques which can help us calculate correlation coefficients.

I am going to focus on the three popular correlation measures:Pearson correlation measure2.

Spearman rank correlation measure3.

Kendall correlation measureI will be explaining how to calculate each of them and what their limitations are.


Pearson CorrelationPearson correlation measures the linear relationship between the variables.

It assumes that the variables are normally distributed.

The Pearson correlation is calculated by dividing the covariance of the two variables by the product of their standard deviations.

Covariance measures how the two variables move with each other over time.

As we divide the covariance by the standard deviations, we make the Pearson correlation unit-less and hence it is always between the values -1 and 1.

The biggest limitation of Pearson correlation is that it assumes that the variables have linear relationship between them.

Most of the variables do not have linear relationships.

As an instance, the financial assets have a non linear relationship between them.

When the value of Pearson correlation is 0, it means that there is no linear relationship between the two variables.

However, there could be a non-linear relationship between the variables.

Hence the value 0 does not imply that the two variables are completely independent of each other.

The variance of the variables is expected to be finite.

This is not the case most of the times, as an instance when the distribution is Student-t.

The Pearson correlation is changed once we transform the data.

Often, in the data science projects, we calculate the log of a variable to transform it into a linear variable.

The side effect of it is that the Pearson correlation will also change.

To compute Pearson correlation in Python:scipy.


pearsonr(variable1, variable2)variable1 and variable2 can be arrays.

Photo by JESHOOTS.

COM on Unsplash2.

Spearman Ranking CorrelationSometimes the elements in our data sets have orders.

This is particular common in time series data.

In those instances, we can calculate the Spearman ranking correlation measure to find the relationship of the ranked variables.

There are three steps to calculate the Spearman rank correlation:If there are two variables X and Y1.

Order the set pairs of variables X and Y with respect to the set X.


Determine the ranks for each time period i.


Compute the difference of the ranks and square the difference.

The correlation will be 1 for perfectly positively correlated variables, -1 implies that the variables have perfect negative correlation and 0 means that there is no correlation between the variables.

The variables are not required to have normal distribution.

We can compute the Spearman ranking correlation in Python:scipy.


spearmanr(variable1, variable2)variable1 and variable2 can be arrays.

Photo by Cris Ovalle on Unsplash3.

Kendall Correlation MeasureThe last important correlation measure is Kendall Tau.

Kendall correlation measure is known as Kendall Tau measure.

It is a nonparametric measure that does not require any assumptions regarding the joint probability distributions of variables.

Kendall Tau measures the correspondence between the two rankings.

We can implement Kendall Tau in Python:scipy.


kendalltau(variable1, variable2)Pandas Is GreatIf you load your data into a Pandas dataframe then you can call a ready-made function in Pandas that can calculate the correlation between every single variable for you.

df = pd.



corr(method)The parameter method could be {‘pearson’, ‘kendall’, ‘spearman’}If you want to explore Pandas then read my article:Did You Know Pandas Can Do So Much?Don’t Code Python Without Exploring Pandas Firstmedium.

comPhoto by Jonny Kennaugh on UnsplashSummaryThis article explained what correlations are, how important they are and the significant role they play.

Finally it explained how we can compute them in Python.

Although the correlation analysis is under-rated but we can see how important it is to measure the correlation and use it wisely in your data science projects.

Hope it helps.


. More details

Leave a Reply