Why Sample Variance is Divided by n-1Explaining high school statistics that your teachers didn’t teachEden AuBlockedUnblockFollowFollowingFeb 20Photo by Tim Bennett on UnsplashIf you are reading this article, I assume you have encountered the formula of sample variance, and kind of know what it represents.

But it remains a mystery that why the denominator is (n-1), not n.

Here’s why.

Originally published at edenau.

github.

io.

TerminologyPopulation: a set that contains ALL members of a groupSample: a set that contains some members of a population (technically a multi-subset of a population)Independent and identically distributed (i.

i.

d.

) random variables:An assumption that all samples (a) are mutually independent, and (b) have the same probability distribution.

Central limit theorem: The sampling distribution of i.

i.

d.

random variables tend toward a normal (Gaussian) distribution when the sample size is large enough.

Expected value: Long-run average value of repetitions of the same experiment.

Unbiased estimator:The unbiased estimator’s expected value is equal to the true value of the parameter being estimated.

In other words, the distributions of unbiased estimators are centred at the correct value.

Photo by Austin Neill on UnsplashSettingsGiven a large Gaussian population distribution with an unknown population mean μ and population variance σ², we draw n i.

i.

d.

samples from the population, such that for each sample x_i from a set X,While the expected value of x_i is μ, the expected value of x_i² is more than μ².

It is because of the non-linear mapping of square function, where the increment of larger numbers is larger than that of smaller numbers.

For instance, set (1,2,3,4,5) has mean 3 and variance 2.

By squaring every element, we get (1,4,9,16,25) with mean 11=3²+2.

We need this property at a later stage.

EstimatorsSince we do not know the true population properties, we can try our best to define estimators of those properties from the sample set using a similar construction.

Let’s put a hat (^) on μ and σ² and call them ‘pseudo-’ mean and variance, and we define it in the following manner:The definitions are a bit arbitrary.

You can, in theory, define them in much fancier ways and test them, but let’s try the most straightforward ones.

We define pseudo-mean ^μ as the average of all samples X.

It feels like this is the best that we can do.

A quick check on the pseudo-mean suggested that it is an unbiased population mean estimator:Easy.

Nevertheless, true sample variance depends on the population mean μ, which is unknown.

We, therefore, substitute it with pseudo-mean ^μ as shown above, such that pseudo-variance is dependent on pseudo-mean instead.

1.

Degree of FreedomAssume we have a fair dice, but no one knows it is fair, except Jason.

He knows the population mean μ (3.

5 pts).

Poor William begs for getting the statistical property, but Jason won’t budge.

William has to make estimations by sampling, i.

e.

rolling the dice as many times as he can.

He gets tired after rolling it three times, and he got 1 and 3 pts in the first two trials.

Photo by Mike Szczepanski on UnsplashGiven the true population mean μ (3.

5 pts), you would still have no idea what the third roll was.

However, if you knew the sample mean ^μ was 3.

33 pts, you would be certain that the third roll was 6, since (1+3+6)/3=3.

33 — quick maths.

In other words, the sample mean encapsulates exactly one bit of information from the sample set, while the population mean does not.

Thus, the sample mean gives one less degree of freedom to the sample set.

This is the reasons that we were usually told, but this is not a robust and complete proof of why we have to replace the denominator by (n-1).

2.

Source of BiasUsing the same dice example.

Jason knows the true mean μ, thus he can calculate the population variance using true population mean (3.

5 pts) and gets a true variance of 4.

25 pts².

William has to take pseudo-mean ^μ (3.

33 pts in this case) in calculating the pseudo-variance (a variance estimator we defined), which is 4.

22 pts².

In fact, pseudo-variance always underestimates the true sample variance (unless sample mean coincides with the population mean), as pseudo-mean is the minimizer of the pseudo-variance function as shown below.

You can check this statement by the first derivative test, or by inspection based on the convexity of the function.

This suggests that the usage of pseudo-mean generates bias.

However, this does not give us the value of bias.

Photo by Tudose Alexandru on Unsplash3.

Bessel’s CorrectionOur sole goal is to investigate how biased this variance estimator ^μ is.

We expect that pseudo-variance is a biased estimator, as it underestimates true variance all the time as mentioned earlier.

By checking the expected value of our pseudo-variance, we discover that:One step at a time.

The expected value of x_j x_k (as shown below) depends on whether you are sampling different (independent) samples where j≠k, or the same (definitely dependent in this case!) sample where j=k.

Since we have n samples, the possibility of getting the same sample is 1/n.

Therefore,Remember the expected value of x_i² mentioned at the start?.By expanding ^μ, we haveSubstitute these formulae back in, and we find out that the expected value of pseudo-variance is NOT population variance, but (n-1)/n of it.

Since the scaling factor is smaller than 1 for all finite positive n, this again proves that our pseudo-variance underestimates the true population variance.

In order to tune an unbiased variance estimator, we simply apply Bessel’s correction that makes the expected value of estimator to be aligned with the true population variance.

There you have it.

We define s² in a way such that it is an unbiased sample variance.

The (n-1) denominator arises from Bessel’s correction, which is resulted from the 1/n probability of sampling the same sample (with replacement) in two consecutive trials.

Photo by freddie marriage on UnsplashAs the number of samples increases to infinity n→∞, the bias goes away (n-1)/n→1, since the probability of sampling the same sample in two trials tends to 0.

Related ArticlesThank you for reading.

If you are interested in data visualization, the following articles might be useful:Visualizing Bike Mobility in London using Interactive Maps and AnimationsExploring data visualization tools in Pythontowardsdatascience.

comWould You Survive the Titanic?The journey on the unsinkable — what AI can learn from the disasterhackernoon.

comOriginally published at edenau.

github.

io.

.. More details