It can happen because we are sampling.
For example, if I picked 100 people at random from a large crowed of thousands and calculated their average height, I might get something like 5 feet 8 inches.
Then if I did it a few more times, I might get 5 feet 10 inches the next time and 5 feet 7 inches the time after that.
Because we are calculating our statistics using samples and not the entire population, every sample mean that we calculate will be different.
Knowing that sampling causes variation, we can reframe our question above into the following:If the new app design truly has zero effect on people’s savings, what is the probability of observing as large an increase in savings as we did from random chance?Stated formally, our null hypothesis would be — the increase in savings rates for the control group is equal to the the increase in savings rates for the experimental group.
Our job is now to test the null hypothesis.
We can do so with a probability thought experiment.
Simulating the Experiment Over and Over AgainImagine that we can easily and instantly run our experiment again and again.
Also, we are still in the parallel world where the new app design is a dud and has zero effect on users’ savings.
What would we observe?For the curious, here is how we simulate this:Take 500 draws (there are 500 users in our control group and another 500 in our experimental group) each of two normally distributed random variables with the same statistical characteristics as our control group (mean = 12%, standard deviation = 5%).
These will be our control and experimental groups (same mean because we are in the world where our new design has zero effect).
It would be technically more correct to use Poisson distributed random variables here, but we use normally distributed ones for simplicity.
Record the difference in mean savings between the groups (i.
we subtract the mean savings rate of the control group from the mean savings rate of the experimental group).
Do this 10,000 times.
Plot a histogram of the differences in mean savings between the groups.
When we do this, we get the histogram below.
The histogram shows how much the mean savings rate difference between groups varies due to random chance (driven by sampling).
The red vertical line shows the mean savings rate difference we actually observed (1%) when our client ran her experiment.
The percentage of observations to the right of the red line in the histogram below is the value we are after — the probability of observing as large an increase in savings as 1% from random chance (we do a one tailed test here because it is easier to understand and visualize).
Histogram Showing the Difference Between Group Means for 10,000 Simulations (Assuming New Design Has Zero Effect on Savings Rates)In this case that value is very low — in only nine out of the 10,000 experiments we ran (assuming the new design has zero effect on savings), did we observe a difference in group means of 1% or greater.
This means that there is only a 0.
09% change of observing a value as high as we did due to random chance!This 0.
09% chance is our p-value.
“Huh?.Stop throwing random terms at me!”, you say.
There is definitely a lot of statistical terminology around hypothesis testing (and A/B testing) and we will leave most of those for Wikipedia to explain.
Our aim, as always, is to build an intuitive understanding of how and why these tools work — so in general we will avoid terminology in favor of simple explanations where we can.
However, the p-value is a critical concept that you will run into a lot in the data science world so we must confront it.
The p-value (the 0.
09% value we calculated above in our simulation) represents:The probability of observing what we observed if the null hypothesis were true.
Thus, the p-value is the number that we can use to test whether the null hypothesis is true or not.
Based on its definition, it looks like we want as low a p-value as possible — the lower the p-value, the less likely it is that we just got lucky with our experiment.
In practice, we will set a p-value cutoff (called alpha) below which we will reject the null hypothesis and conclude that the observed effect/impact is most likely real (statistically significant).
Now let’s explore a statistical property that lets us quickly calculate p-values.
The Central Limit TheoremNow is as good a time as any to talk about one of the foundational concepts of statistics — the Central Limit Theorem states that if you add up independent random variables, their normalized sum tends towards a normal distribution as you sum more and more of them.
The Central Limit Theorem holds even if the random variables themselves do not come from a normal distribution.
Translation — if we calculate a bunch of sample averages (assuming our observations are independent of each other, like how flips of a coin are independent), the distribution of all those sample averages will be the normal distribution.
Q-Q Plot — The Red Line Denotes a Perfectly Normal DistributionTake a look at the histogram of the mean differences that we calculated earlier.
It looks like a normal distribution right?.We can verify normality using a Q-Q plot, which compares the quantile of our distribution against that of a reference distribution (in this case, the normal distribution).
If our distribution is normal, it would adhere closely to the red 45 degree line.
And it does, cool!So when we ran our savings experiment over and over again, that was an example of the Central Limit Theorem in action!So why does this matter?Remember how we tested the null hypothesis earlier by running 10,000 experiments.
Doesn’t that sound super tiring?.In reality, it’s both tiring and costly to repeatedly run experiments.
But thanks to the Central Limit Theorem we don’t have to!We know what the distribution of our repeated experiments will look like — the normal distribution and we can use this knowledge to statistically infer the distribution of our 10,000 experiments without actually running them!Let’s Review What We Know So Far:We observe a difference in mean savings rate of 1% between the control and experimental group.
And we want to know whether this is a real difference or just statistical noise.
We know that we need to take the experiment’s results with a grain of salt because we conducted it on only a small sample of the client’s total user base.
If we did it again on a new sample, the results would change.
Since we are worried that in reality the new app design has no impact on savings, our null hypothesis is that the difference in means between the control and experimental group is zero.
We know from the Central Limit Theorem that if we were to repeatedly sample and conduct new experiments, the results of those experiments (the observed mean difference between the control and experimental groups) would have a normal distribution.
And from statistics, we know that when we take the difference of two independent random variables, the variance of the result is equal to the sum of the individual variances:Completing the JobNice!.We now have everything that we need to run our hypothesis test.
So let’s go ahead and complete the job we received from our client:Same Histogram as Above (Pasted Again for Reference)First, before we get biased by looking at the data, we need to choose a cutoff, called alpha (if our calculated p-value is less than alpha, we reject the null hypothesis and conclude that the new design increases savings rates).
The alpha value corresponds to our probability of incurring a false positive — rejecting the null hypothesis when it is actually true.
05 is pretty standard among statisticians so we will go with that.
Next, we need to calculate the test statistic.
The test statistic is the numerical equivalent of the histogram above and tells us how many standard deviations away from the null hypothesis value (in our case zero), the observed value (1%) is.
We can calculate it like so:The Standard Error is the standard deviation of the difference between the experimental group’s average savings rate and the control group’s average savings rate.
In the plot above, the standard error is represented by the width of the blue histogram.
Recall that the variance of the difference of two random variables is equal to the sum of the individual variances (and standard deviation is the square root of variance).
We can easily calculate the standard error using information that we already have:Remember that both the control and experimental group’s savings rates had a standard deviation of 5%.
So our sample variance is 0.
0025 and N is the number of observations in each group so N is equal to 500.
Plugging these numbers into the formula, we get a standard error of 0.
In the test statistic formula, Observed Value is 1% and Hypothesized Value is 0% (since our null hypothesis is that there is no effect).
Plugging in those values along with the standard error we just calculated into the test statistic formula, we get a test statistic of 0.
00316 = 3.
Our observed value of 1% is 3.
16 standard deviations away from the hypothesized value of 0%.
That’s a lot.
We can use the Python code below to calculate the p-value (for a two-tailed test).
The p-value we get is 0.
Note that we use p-values for a two-tailed test because we can’t automatically assume that the new design is either the same or better than the current one — the new design could also be worse and a two tailed test accounts for that possibility (more on this here).
stats import norm# Two Tailed Testprint('The p-value is: ' + str(round((1 – norm.
16))*2,4)))The p-value of 0.
0016 is below our alpha of 0.
05 so we reject the null hypothesis and tell our client that yes, it appears that the new app design does indeed help her users save more money.
Hurray, victory!Photo by Rakicevic Nenad from PexelsFinally, note that the p-value we calculated analytically of 0.
0016 is different from the 0.
0009 that we simulated earlier.
That’s because the simulation we ran was one-tailed (one-tailed tests are easier to understand and visualize).
We can reconcile the values by multiplying the simulated p-value by two (to account for the second tail) to get 0.
0018 — pretty close to 0.
ConclusionIn the real world, A/B testing won’t be as clear cut as our fictitious example.
Most likely our client (or boss) won’t have ready-to-use data for us and we will have to do our own data gathering and cleaning.
Here are some additional practical issues to keep in mind when preparing to A/B test:How much data do you need?.Data is time consuming and expensive to gather.
A badly run experiment might even end up alienating users.
But if you don’t gather enough observations, your tests will not be very reliable.
So you will need to carefully balance the benefits of more observations with the incremental costs of gathering them.
What are the costs of falsely rejecting a true null hypothesis (Type 1 Error) versus the costs of failing to reject a false null hypothesis (Type 2 Error)?.Going back to our example, a Type 1 Error is equivalent to green-lighting the new app design when it actually has no effect on savings.
And a Type 2 Error is the same as sticking with the current design when the new one actually encourages people to save more.
We tradeoff between the risk of Type 1 and 2 errors by picking a reasonable cutoff value, alpha.
A higher alpha increases the risk of Type 1 Error and a lower alpha increases the risk of Type 2 Error.
Hopefully this was informative, cheers!If you got all the way here, please check out some other pieces by me:This is my favorite out all my pieces so far, it’s about neural netsWhy random forests are greatI miss my Metis bootcamp experience and friends already!A project I worked on while at Metis, investing in Lending Club loansMy first data science post, logistic regression.