the fun part ????Collect and analyze dataYou’ll want to pull data and calculate your previously defined success metric for both version A, B, and the difference.
If there was no difference overall, you might also want to segment by platform, source type, geography etc if applicable.
You may find version B performed better or worse for certain segments.
Check for statistical significance The statistical theory behind this approach is better explained here but the basic idea is to figure out whether the difference in results between A and B was due to what you changed or due to randomness/natural variance.
This is determined by comparing the test statistic (and resulting p-value) to your significance level.
Protip: You can also think of statistical significance in terms of confidence levels which is simply 1-α.
Best practice is usually setting the confidence level to 1-.
05 = .
95 or 95%.
This means if you were to repeat the test 100 times, you would get the results you see 95 of those times.
The significance level and confidence level can both be used to determine statistical significance but this example uses the significance level.
The basic steps to determine significance are:Calculate the test statistic.
The test statistic is the value that we use to compare the results between A and B.
It accounts for how much difference we see between our results and how much variability we have in our data.
Often either the Z Statistic or t statistic is used as the test statistic depending on what you know about the actual population.
Read more here on when to use which although practically there will be little difference between the two if sample sizes are large.
Use the test statistic to calculate the p-value.
The p-value is the probability that the difference in results between the A and B versions are due purely to chance.
So a really low p-value like 1% means that it’s extremely unlikely that the difference in A and B are due to chance.
This is calculated from the test statistic.
You may recall in your college stats class looking up p-values from a table but luckily there are online calculators today that’ll do that for you.
Compare p-value to the significance level.
If p-value < significance level, we can reject the null hypothesis and have evidence for the alternative.
If p-value ≥ significance level, we cannot reject the null hypothesis.
One-tailed hypothesis test showing when the p-value is less than the significance level.
EXAMPLE:Collect & Analyze dataSince we want to be able to compare the registration rates of A and B, we’ll want to collect data on the number of FTUs and number of those FTUs that successfully registered over the 20 days.
See example data displayed below.
js"></script>Table comparing registration rate of new users that received Version A vs B.
Great!.Looks like B is performing better than A.
BUT before we pat ourselves on the back, we need to check for significance.
Check for statistical significanceWe will use the t-statistic since let’s say we don’t know the population statistics.
If you don’t know anything about the population and population statistics (e.
, mean and variance), you are better off using the t-statistic.
If you do know you can use the Z statistic.
Calculate t-statistic I used an online calculator here that also shows you the formula and raw calculation.
t-statistic = 2.
Calculate the p-valueYou can use a p-value calculator or Excel t-test formula TTEST which will return the p-value.
Remember, it is a one-tail two sample test with a significance level of 0.
P-value = .
Compare p-value to the significance level.
P-value is .
0115 < .
05The result is significant!This tells us there’s only a ~1.
15% chance the different results in registration rate was the result of chance.
So we can infer that the alternative (ie that B did have a higher increased registration rate than A) is true.
Drawing conclusionsSo you’ve looked at the results and checked for significance.
A/B tests can end the following ways:For the majority of tests, the new version will perform the same or worse than the original.
Control A wins or no difference.
Barring reasons that would result in an invalid test (i.
, a test where results are not accurate or significant), the new version’s worse performance can be because of:Poor messaging/branding of the value propUnattractive value propPoor user experienceIn this scenario, you could dig into the data or conduct user research to understand why the new version didn’t perform better as expected.
This will in turn help inform your next test.
Variant B wins.
The A/B test supported your hypothesis that version B resulted in better performance than A.
Great!.After sharing results, you can roll out the experiment to 100%.
It’s good to keep an eye on the success and topline metric afterward.
In our example, we concluded that version B did have a higher registration rate than A so we will push live to all users.
We would then monitor the registration rate and DAU growth in the coming weeks.
Wrapping it all upRegardless of whether your test was successful or not, treat every experiment as a learning opportunity.
Use what you’ve learned to help develop your next hypothesis.
You can build on the previous test for example or focus on another area to optimize.
The possibilities are endless on what you can test and achieve.
Happy experimenting!.????????Thanks for reading!.Want to connect?.Reach out @lisamxu.
Feedback and comments welcome!.Special thanks to Thomas Martino, Joann Kuo, Adam Gadra, Sarah Bierman and Gabriel Strauss for their review.
ResourcesTemplates for A/B testing here (Z statistic) and here (T-test)Sample size calculators here and hereT and Z statistic calculatorP-value calculator and Excel.