The Central Limit Theorem and its Implications

The central limit theorem goes something like this, phrased statistics-encrypted:The sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution.

Let’s phrase it in plain English maybe (sorry statisticians):If you sample batches of data from any distribution and take the mean of each batch.

Then the distribution of the means is going to resemble a Gaussian distribution.

(Same goes for taking the sum)I don’t know if that definition is any simpler to understand (for me it is).

But maybe let's make it more tangible through an example.

Let's take a distribution other than the Gaussian, a gamma distribution for example.

The gamma distribution looks basically like this:Although it is quite obvious how the data looks when we sample from it, let's just make it concrete.

When sampling data from the gamma distribution, the data looks something like this when we make a histogram plot:Logically, it resembles the density function of the gamma distribution that we sampled from.

Now, let’s do what we stated in the central limit theorem.

Let’s sample 2000 batches of data from the gamma distribution of size 30 for example, and take their mean to see what happens.

Surprise surprise, this actually looks like a normal (Gaussian) distribution.

So what if we vary the batch size of the samples?.Let's try sampling with really small batch size, let's say 2:Logically, we are back to looking more like the gamma distribution.

What happens if we use a really big batch size for the sample, let’s say 1000:We are back to looking more normal!.So, that is kind of awesome.

Now, just for fun, let us take the uniform distribution in order to be fully convinced.

The uniform distributions histogram plot looks something like this:Quite noisy right?.Now let's do the same thing we did for the Gamma distribution, take the means of samples and plot the resulting distribution:Once again, it looks like a Gaussian!.Hopefully, if you were not convinced till now, this was able to convince you.

But now that we know this cool fact, how is it actually useful?.Let’s turn to machine learning for a second.

The nice thing about the normal distribution is that it has only 2 parameters needed to model it, the mean and the standard deviation.

With these, based on the central limit theorem, we can describe arbitrarily complex probability distributions that don’t look anything like the normal distribution.

This allows us kind of to cheat and to say that we can use the normal distribution ubiquitously in statistical inference, although the underlying distributions don’t look anything like it.

The normal distribution also has a couple of nice properties that make it useful.

For example, just taking the logarithm of the density (otherwise known as the log-likelihood) simplifies many things in machine learning and expands naturally to Mean Squared Error, a loss function that is often used in regression tasks.

Actually, it is relatively trivial to show that using Mean Squared Error amounts to performing Maximum Likelihood Estimation.

Furthermore, if applying L2 regularization you can show that this amounts to Maximum Aposteriori approximation with a Gaussian prior, also quite trivial.

Going back to the Central Limit Theorem, think about Deep Learning and Stochastic Gradient Descent for a minute.

How is Stochastic Gradient Descent Performed?.We mostly take a batch sampled from our training set and calculate the mean or sum of the loss over the batches.

So how is this loss going to be distributed?.Well, the Central Limit Theorem tells us exactly this, if the batch size is large enough, the resulting distribution of the loss estimates is going to be Gaussian!.As a next step, we are going to talk about confidence intervals and how they play a role in statistical inference.

Confidence IntervalsLooking at it from a more statistical perspective, the Gaussian distribution is used for hypothesis testing using confidence intervals, or to look at the statistical significance of experiment results.

So what are exactly confidence intervals?.Let’s shoot straight for the example.

Let us imagine that we want to estimate the mean of how much we sleep on average.

To do this, we take 200 days and take the mean of sleeping times.

This is called a point estimate of the mean.

Now we ask ourselves, what does the distribution of the mean estimate look like?.We already know that it looks like a Gaussian because of the Central Limit Theorem, but we can know something else also.

We can calculate the standard deviation of the mean, namely it is equal to the following:So what we are looking at is the sample standard deviation divided by the square root of the sample size.

Notice that when the sample size goes into infinity that the standard deviation of the mean estimate goes to zero.

This makes complete sense since then we have a perfect estimate of the mean with infinite samples.

Now, how do we construct a confidence interval?.A confidence interval is always bound to a probability, meaning that when we say a 95% confidence interval, for example, we mean that the values within this range “live” within 95% of the density.

Or stated differently, it is with 95% probability that we are sure that the sampled values are going to land within this range for the given distribution.

So, basically, our confidence interval(95%) looks like this:In general, to construct a confidence interval of X%, statisticians mostly use something that is called Z-scores.

I won’t cover the exact method of calculation here.

But it amounts to calculating the Z scores (which are actually standardized deviations from the mean of a specific value) and searching for the specific probability in the Z-table, note that this operation also works in the other direction (calculating values for given confidence intervals)Now we have our mean estimate, we have an X% confidence interval.

The correct question to ask at this point is what does this confidence interval say about the true mean or in other words, how far are we from the true mean estimate?. More details