Uhmm… well, then the decision is very likely to be made by people’s gut.
Second, once achieved statistical significance, feel like 100% confident about the testing.
Think about what’s the definition of the p-value: the probability of observing more extreme results under the null hypothesis.
Based on the definition, the proportion of the confidence (also means the tolerance of making mistakes) must be pre-define during the experiment design.
Still have to keep calm and don’t become overly optimistic when results reach statistical significance.
Stop testing too early instead of to not wait until a controlled A/B test has achieved a decent significance level.
Other examples would be constantly checking the significance of metrics when testing performance of variants (ad copy, images and etc).
A test is stopped prematurely as soon as it hits a p-value, which greatly increases the likelihood of a false positive rate.
Needless to say, this leaves us with no objective estimate of the validity of the data we operate with.
It also happens that using absolute value to state how much lift from the change when explaining A/B testing results, while A/B testing just tells you which one is better.
Common Mistakes with Statistical PowerThe first thing could happen with an underpowered test is that a practically significant change but you might fail to detect it.
Using such a test one would conclude that the tested variants had no effect, when actually it has ( this is called a false negative).
This common mistake wastes a lot of budgets on tests that never receive enough traffic to have a proper chance to demonstrate the desired effect.
What use is running test after test if they have inadequate sample size to detect meaningful improvements.
But when run an overpowered, it increases the power(false negative rate decreased), while at the same time it also increases the false positive rate which hurts your confidence level.
There is a trade-off between Type I and Type II error and we barely discuss it during the A/B testing.
But it always good to know the math and statistical concepts behind them.
In short, there are two ways to get power wrong: to conduct an underpowered or an overpowered test.
Wrong post-analysisEven if you set up everything correctly, you post-analysis could still go wrong.
If you get your data distribution doesn’t look either p1 or p2.
P1p2But it might look similar to p3.
your analysis goes wrong because your parametric approach is not suitable in some cases (for example if your distribution look like p3).
This is where parametric tests could go wrong.
(I will explain more in the follow-up post)LAST…Statistics can help the marketer achieve both of those goals as well as evaluate the success of the marketing effort and provide data on which to base changes to the marketing program.
Experimentation is a great way to learn and gain actionable insights.
Therefore it’s necessary to get an intuitive feel for the math underlying A/B test analysis.
Though A/B test is not the perfect method but A/B test has to be done correctly at least since it still the most practical test in business world.
There will be follow-up posts.
Next one, I will explain the disadvantages of A/B test with alternative approaches.
Reference:Marketers Flunk the Big Data Test — ’Harvard Business Review’Marketers failing to ‘deal intelligently’ with data — ’Marketing Week’p-Hacking and False Discovery in A/B Testing — Ron Berman (University of Pennsylvania — The Wharton School)Underpowered or an Overpowered Test — University of Texas.