Well it’s simple we use a theorem and make an assumption based on it.
*Enter* The Central Limit TheoremIn particular, the traditional methods use the implications of the Central Limit Theorem.
In the most layman terms the central limit theorem states thatIf I repeatedly take samples from a population and take the mean of each sample , then given enough samples, all those means would follow an approximately normal distribution regardless of the distribution the population itself follows.
Visual Representation of The Central Limit TheoremStatistical theory also states that any linear combination of normal distribution(s) is also normally distributed.
Using this information we can assume that if ( I’m not actually going to do it) I was to repeatedly conduct my experiment over and over again and collect my test statistic from the two groups of my visitors, then all my test statistics would follow an approximately normal distribution.
No heavy computational power required since I’m not actually doing the experiment over and over again.
I’m making an assumption about what kind of distribution I would see if I did.
Now with this assumption , I can go ahead and compute my standardized t-statistic from my test statistic to get a t-score for my observed value under the null hypothesis.
Note: Now I’m sure some of you might be wondering why we won’t be computing a z-statistic to get the z-score, since that’s what we get when we standardize a normal distribution.
The reason for that is , to get the standard normal distribution and hence the z-scores we need the population standard deviation which we also don’t know.
We only know the standard deviations of the two samples, group A and group B , that we collected.
So in the absence of the population standard deviation, to work with the sample standard deviations we’re going to use an approximation of the normal distribution, known as the t-distribution with varying degrees of freedom and hence the t-scores.
Fun fact, the higher our degrees of freedom ( depends on our sample sizes), the closer we get to the standard normal distribution ( asymptotically of course).
Below is the formula for calculating our t-statistic.
This little formula is derived under the null hypothesis , that there is no difference in the true population means for each sample ( for our example that means that the mean time spent on our website by visitors in each group is the same).
That is, their difference is 0 which is exactly the assumption we’re trying to get our t-statistic under ( our hypothetical reality remember!)T-statistic under the null hypothesisWe simply plug in the values we have and out pops our t-statistic.
X1 is the sample mean for my treatment group, X2 is the sample mean for my control group, same scheme for the S’s which are sample standard deviations and n’s which are sample sizes.
Once we have our t-statistic , all we have to do is see where it lies on the t-distribution curve with the appropriate degrees of freedom and calculate the probability of seeing a result as extreme as that, which will be our p-value.
The degrees of freedom are calculated simply using our samples sizes in the following way.
Degrees of Freedom CalculationLet’s assume that for our e-commerce website our t-statistic came out to be about 2.
39 and our degrees of freedom was equal to 60 based on our sample sizes.
Well from here , I don’t even need a calculator to compute my p-value.
I could use a simple t-distribution look up table.
Look up table for the t-distributionAccording to this table the area under the right tail in the curve below , which is the probability of seeing a difference in mean at least as extreme as the one we observed during our experiment, under the null hypothesis, that is by chance is about 1%.
In other words, our p-value is 0.
01Area under the curve for a one-sided t-testNow we can use this result to either reject the null hypothesis and conclude that our new page does help in increasing time spent by each visitor on our website or we can conclude that we saw our original result by chance.
Usually in most research/academic areas you’ll see that getting a p-value under 5% means we can reject the null hypothesis, which is the case for our example.
Here’s a little stats meme to go with it, for ya.
Computational/Contemporary ApproachWhat if, after declaring the null hypothesis there was a way for us to actually simulate the outcome of our experiment many many times under the null hypothesis ( to create our hypothetical reality, remember) so we won’t have to rely on theorems , assumptions and approximations of the standard normal distribution to tell us what we would see if we did in fact repeat our experiment many many times.
*Enter* ComputersLucky for us, unlike the old days we have the computational power to resample the results of our experiments hundreds of thousands even millions of times in a matter of seconds.
Something that is not so easy or even feasible to do manually, especially during the pre-computer era.
This means that using certain resampling techniques that work, we can simulate our hypothetical reality without relying on any theorems, assumptions of normality or having to consider our knowledge of the population standard deviation at all.
Going back to our example with our two groups, Group A ( control group) and Group B ( treatment group), we can use certain straightforward resampling techniques to stimulate our data collection under the null hypothesis.
Let’s say we have our two groups as arrays.
With group A being an array with the times spent on our website by each visitor who was shown the old landing page.
Similarly group B is an array with the times spent on our website by each visitor who was shown the new landing page.
Let’s say both these arrays are of size 31.
Let’s also say that when we conducted the experiment, we found that the mean time spent on our website by visitors in Group B was about 30 seconds more than those of Group A.
In order to stimulate data collection under the null hypothesis, we need to create the conditions under which there is no difference in the mean of both these arrays.
Here’s how two resampling techniques make that possible.
Permutation ResampleConcatenate both arrays into one big array.
For us this would have size 62, since both our arrays had size 31 each.
Shuffle that entire array, so now observations from each group are spread randomly throughout that array instead of being separated in the middle.
Arbitrarily split the array in the middle, assign whatever observations ended in the first 31 indices of the array to Group B and the rest to Group A.
Subtract the mean of this new Group A from the mean of the new Group B.
This would give us one permutation test statistic.
Here’s a visual representation of a permutation resample —Assume Observations 1,2,3,4 originally belonged to Group A and observations 5,6,7 originally belonged to Group BBy pooling both the samples and then shuffling them together we’ve eliminated the distinction in their distribution.
We’ve created our hypothetical reality where they come from the same population without using any theorems or derived standardized statistics.
We repeat the aforementioned steps hundreds of thousands of times and calculate hundreds of thousands of permutation test statistics.
Then we simply see what proportion of our calculated permutation test statistics out of the total permutation test statistics that we calculated were at least as high as our original difference in means of 30 seconds.
Et Voila!.That’s our p-value.
No approximations of the normal distribution or look up tables required.
Bootstrap ResampleBootstrap resampling is simply the process of resampling from a distribution with replacement.
This means that we might end up re-sampling one value multiple times, multiple values multiple times, no value multiple times, we might even have samples where some values never show up.
It’s all good.
Here’s a visual representation of what a bootstrap resample looks like —Before we start taking our bootstrap resample, we have still have to create the conditions for our null hypothesis to hold, to resample under.
We can do that with the following steps —Consider the array with the observations for group B ( the treatment group)Subtract the mean of Group B from each element in that arrayAdd the mean of Group A (control group) to each element in that array.
This is our new array with a shifted mean.
Take a bootstrap resample of the same size from this array.
Calculate the difference in mean of this bootstrap resample from the mean of the original array with the observations for Group A (control group)This will give us one bootstrap test statisticHere we simply took the observations for Group B with a higher mean time spent on our website and shifted it’s mean to that of Group A.
Again creating our reality where both these arrays have the same mean, to collect the data under.
Then we take hundreds of thousands of bootstrap resamples from this shifted array to compute hundreds of thousands of bootstrap test statistics and the rest is the same as what we did for permutation resample.
We simply calculate the proportion of the bootstrap statistics out of the total bootstrap statistics that we collected which were at least as high as our original difference of 30 seconds to get our p-value.
Again, no approximations of the normal distribution or look up tables required.
SummaryConclusionThere’s a ton of resources out there to help you master significance testing and all it’s related concepts.
That can also mean that sometimes it’s easy to get confused when navigating your way around different methods, tutorials and guides to decide which one you want to focus one.
Hopefully this post gave you a good idea about the essence of what a significance test entails and the different ways its goals are achieved across two arenas.
There are many related and important concepts to significance testing such as Type I Error, Type II Error, Effect Size, Power.
etc that I did not cover but you’ll find in whatever resources and application you decide to focus on, now you know the difference.
If there is anything that I missed or something was inaccurate or if you have absolutely any feedback , please let me know in the comments.
I would greatly appreciate it.