Beyond A/B Testing: Multi-armed Bandit ExperimentsAn study of Google Analytics’ stochastic k-armed bandit test with Thompson sampling and Monte Carlo simulationShaw LuBlockedUnblockFollowFollowingApr 3A/B Testing RecapA/B testing relies on classic statistical test for statistical significance.
When we come up with a new product feature, we probably want to test whether it is useful before launching it to the entire user base.
The test involves two groups: a treatment group (who have access to the new feature) and a control group.
Then we measure a key metric for the two groups: average time on site (social network), average checkout time (e-commerce), or click through rate (online ads).
The difference between the groups is tested for statistical significance.
Classical statistical test (z-test, t-test) guarantees that false positive rate is no more than α, which is often set to 5%.
This means that when there is no difference between the treatment group and control group, the test will find a statistical difference by chance 5% of the time.
A balanced AB test would allocate equal amount of traffic to each group, until reaching sufficient sample size.
We cannot adjust traffic during the test based on what is observed.
So the disadvantage of A/B testing is obvious: if the treatment group is clearly superior, we still have to spend lots of traffic on the control group, in order to obtain statistical significance.
Multi-armed BanditWhereas A/B testing is a frequentist approach, we can also conduct the test from the Bayesian way.
It is understandable that once we see one treatment is clearly better, we want to add more users to that treatment right away.
Multi-armed bandit experiment makes this possible in a controlled, scientific way.
The foundation of the multi-armed bandit experiment is Bayesian updating.
Each treatment (called “arm”) has a probability of success, which is modeled as a Bernoulli process.
The probability of success is unknown, and is modeled by a Beta distribution.
As the experiment goes on, each arm receives user traffic, and the Beta distribution is updated accordingly.
To visualize the updating process, see my earlier post here.
In this post, I’m using the Google Analytics example of online advertisements matching.
Suppose there are K arms.
Each arm is an ad with click-through rate (ctr) that follows a Beta distribution.
The goal of the experiment is to find the ad with the highest click through rate.
Thompson SamplingIn a nutshell, Thompson sampling is a greedy method that always chooses the arm that maximizes expected reward.
In each iteration of the bandit experiment, Thompson sampling simply draws a sample ctr from each arm’s Beta distribution, and assign the user to the arm with the highest ctr.
Applying Bayesian updating to the Beta distribution of each armThe elegant part of bandit experiment is that Thompson sampling and Bayesian update work hand-in-hand.
If one of the arms is performing well, its Beta distribution parameters are updated to remember this, and Thompson sampling will more likely draw a high ctr from this arm.
Throughout the experiment, high-performing arms are rewarded with more traffic, whereas under-performing arms are punished with less traffic.
Monte Carlo SimulationWhereas the Beta distribution estimates the ctr, we need to know how confident we are about each estimate of ctr.
If we are confident enough about the arm that currently has the highest ctr, we can end the experiment.
Monte Carlo simulationThe way Monte Carlo simulation works is to randomly draw samples from each of the K arms multiple times, and empirically compute how often each of the arms wins (with highest ctr).
If the winning arm is beating the second arm by a large enough margin, the experiment is ended.
TerminationGoogle Analytics introduces the concept of “value remaining in experiment” (more details here).
In each Monte Carlo simulation, the value remaining is computed.
If we choose α = 5%, then the experiment terminates when 95% of the samples in a Monte Carlo simulation have zero value remained.
AdvantageThe main advantage of bandit experiment is that it terminates faster than A/B test because it requires much smaller sample.
In a two-armed experiment with click-through rate 4% and 5%, traditional A/B testing requires 11,165 in each treatment group at 95% significance level.
With 100 users a day, the experiment will take 223 days.
In the bandit experiment, however, simulation ended after 31 days, at the above termination criterion.
Amount of traffic sent to the losing arm on each day (“mistakes”).
A second advantage of bandit experiment is that the experiment is making fewer mistakes than A/B testing.
A balanced A/B test would always send 50% of traffic to each group.
The plot above shows that as the experiment progresses, fewer and fewer traffic was sent to the losing arm.
Here is how a 5-armed bandit experiment ran in simulation.
We see that the red arm (with ctr 4.
4%) was mistaken as the winning arm during the first 150 iterations, and we were diverting as much as 80% traffic to the losing arm.
But the true blue arm (ctr 4.
8%) caught up and emerged as the true winner.
TradeoffAs it turns out, there is no free lunch, and the convenience of smaller sample size comes at a cost of a larger false positive rate.
Although I had used α empirically as the false positive rate to terminate experiment, the false positive rate is higher than α, after repeating the simulation many times.
sample size (traffic) and probability of finding correct winnerSpecifically, an α of 5% gives empirically probability of finding the winning arm about 91% of the time, instead of 95%.
The smaller the α we set, the larger the sample size we need (shown in red), which is consistent with how A/B testing behaves.
ConclusionAs it turns out, there is no definite winner, and it is important for product manager, data scientist and practitioners to understand the strength and weakness of the two methods before making a choice.
Multi-armed bandit test is preferred during the following situations:When the cost of sending users to a losing arm is high.
In this example, matching users with a bad advertisement simply results in less revenue.
The loss is affordable, and so A/B testing will do just fine.
In other cases, such as when testing two different account recovery methods, a failure in each arm means permanent loss of a user, and multi-armed bandit experiment is clearly a better option.
For early startups with insufficient user traffic, multi-armed bandit experiment works better because it requires a smaller sample size, terminates faster, and is more agile than A/B testing.
When there are more than two variants to test, bandit experiment can help us find the winner quickly.
In bandit experiment, it is common to test 4~8 variants at a time, whereas A/B testing is limited to two groups per test.
On the other hand, A/B testing is a better option when the company has large enough user base, when it’s important to control for type I error (false positives), and when there are few enough variants that we can test each one of them against the control group one at a time.