Statistics is the Grammar of Data Science — Part 4/5Statistics refresher to kick start your Data Science journeySemi KoenBlockedUnblockFollowFollowingFeb 10This is the 4th article of the ‘Statistics is the Grammar of Data Science’ series, covering the important topics of Covariance and Correlation.
RevisionBookmarks to the rest of the articles for easy access:Article SeriesPart 1: Data Types | Measures of Central Tendency | Measures of VariabilityPart 2: Data DistributionsPart 3: Measures of Location | MomentsPart 4: Covariance | Correlation ????Part 5: Conditional Probability | Bayes’ TheoremIntroductionTo lay the basis for this article, we will assume we have a scatterplot and each data point represents a person: their professional experience in years on one axis versus their income on another.
‘Professional experience vs.
Income’ scatterplotIf the scatterplot looks like the one on the right, then we derive that there is no a real relationship between experience and income, i.
for any given experience there can be a range of incomes.
On the contrary, there is a clear linear relationship between these attributes on the the left diagram.
Covariance and correlation give us a means of measuring just how tight the attributes of a dataset depend on each other.
B: In this example, the type of data described is bivariate — ‘bi’ for two variables.
In reality, statisticians use multivariate data, meaning many variables.
CovarianceCovariance is a measure of association between two (or more) random variables.
As the name ‘co + variance’ implies, it is like the variance, but applied to a comparison of two variables — in place of the sum of squares, we have a sum of cross-products.
While Variance tells us how a single variable varies from the mean; Covariance tells us how two variables vary from each other.
As such it’s fair to say:Covariance is measuring the Variance between two variables.
Covariance can be negative or positive (or zero obviously): A positive value means that the two variables tend to vary in the same direction (i.
if one increases, then the other one increases too), a negative value means that they vary in opposite directions (i.
if one increases, then the other one decreases), and zero means that they don’t vary together.
FormulaThe formula might be hard to interpret, but it is more important to understand what it means:Covariance between variables X and YIf we think that the dataset of a random variable is represented as a vector, then in the previous example, we have two vectors for experience and income.
Here are the steps we need to follow:#1.
Convert these two vectors to vectors of variances from the mean.
Take the dot product of the two vectors (which is equal to the cosine of the angle between them).
Divide by the sample size (n or n – 1, as discussed before, based on whether it is full population or not).
On the 2nd step, we effectively measure the angle between these two vectors, so if they are close to each other, it means that these variables are tightly coupled.
Main LimitationIt is important to note that while the Covariance does measure the directional relationship between two variables, it does not show the strength of the relationship between them.
In practice, the biggest problem with this metric is that it depends on the units used.
For example, if we were to convert the years of experience into months of experience, then the Covariance would be 12 times larger!This is where Correlation comes in!Rainbow scatterplot.
Courtesy: ScipyCorrelationThe Correlation is one of the most common metrics in Statistics that describes the degree of relationship between two random variables.
It is considered to be the normalised version of the Covariance.
Let’s see why…FormulaThe Correlation (represented by the Greek letter ρ — rho) can be expressed using this formula:Correlation between variables X and YThe correlation is bounded between -1 and 1.
Like the Covariance, the sign of the Correlation indicates the direction of the relationship: positive means that random variables move together, negative means that random variables move in different directions.
The endpoints (i.
1 and -1) indicate that there is a perfect relationship between the two variables.
For instance, the relationship between meters and centimetres is always that 1m corresponds to 100cm.
If we plot this relationship it will be a perfect line, and therefore the Correlation is 1.
Please note that a perfect relationship is pretty rare in real life data, since two random variables don’t usually map to each other by a constant factor.
A Correlation of 0 means that there is no linear relationship between the two variables.
There might be a x = y² relationship.
Key CharacteristicsThe Correlation does not only indicate the direction of the relationship but also its strength, (depending on how big the absolute value is) as it is unitless: Since we divided the Covariance by the Standard Deviation, the units were cancelled out.
Finally, we need to remember that ‘Correlation does not imply Causation’: a high correlation between two random variables just means that they are associated with each other, but their relationship is not necessarily causal in nature.
The only way to prove causation is with controlled experiments, where we eliminate outside variables and isolate the effects of the two variables in question.
All done!.We have learnt how to use Covariance and Correlation to measure whether two different attributes in our dataset are related in a linear fashion to each other and why Correlation is usually preferred, since it is unitless.
Thanks for reading!.Part 5 is coming soon…I regularly write about Technology & Data on Medium — if you would like to read my future posts then please ‘Follow’ me!.