How to Make Data-Driven Decisions with Contextual Bandits

The main idea with Thompson sampling is:“When in doubt: Explore!” — Michael Klear, trying to explain Thompson sampling.The simpler-case of contextual bandits, known as the multi-arm bandit problem, is easily solved using Thompson sampling.The idea is simple: using our prior understanding of the expected reward distribution behind each action, let’s draw samples and pick the argmax as our selected action..The tricky part is to apply it to the contextual case.Bayesian Regression for Thompson SamplingWhen we introduce context to the multi-arm bandit problem, we can no longer simply learn a reward distribution for every action..This distribution can be sampled from, opening the door to Thompson sampling in the contextual bandits case.Bayesian regression reward optimization — Maximum Reward is the known-best policy, and cumulative reward is the trained policy..In 2018, we have neural networks to do this for us.Enter Deep LearningThe 2018 Contextual Bandits Showdown paper explores a clever adaptation of the Bayesian linear regression solution..Simply called the Neural Linear algorithm, Google Brain researchers apply a neural network to learn a set of features to feed in to a Bayesian linear regression model.The result?.It’s simple, computationally efficient, and the proof is in the pudding.“…making decisions according to a Bayesian linear regression on the representation provided by the last layer of a deep network offers a robust and easy-to-tune approach.” — Riquelme et al., Google BrainYou Can Apply State-of-the-art ResearchThis algorithm is generally applicable and simple at heart..A budding data scientist need only import some class, and she can begin experimenting immediately.So when I stand up in front of a group of data-scientists-in-training and teach them about deep bayesian contextual bandits, they’re eager to try it.The only problem?.The only available option, hand-coding implementations from scratch, isn’t always feasible when you have deadlines to meet.Space Bandits is BornI decided to take Google Brain’s open source code and package it up so my trainees could use it.I’m not ambitious; I took the simplest model (Bayesian Linear) and the best model (Neural Linear) from the Tensorflow open-source code, optimized it for use at-scale, and uploaded it to PyPI.It just needed a name..This means that records from arbitrary campaigns can be used for optimization.Space Bandits is able to learn a “soft” decision boundary with Thompson sampling and a Bayesian linear model — only observing rewards given actionsA Note on Reward Design“I’ve taken to imagining deep RL as a demon that’s deliberately misinterpreting your reward and actively searching for the laziest possible local optima.” — Alex Irpan, Google, Deep Reinforcement Learning Doesn’t Work YetAlex perhaps puts it best, but anybody with experience in machine learning should not be surprised by this.RL algorithms, including contextual bandits, do not readily generalize..If your model finds a “trick” to optimize reward, as long as that reward is profit, you should be happy.A Space Bandits Neural Linear model learns nonlinear decision boundaries..Thompson sampling encourages more exploration closer to “true” decision boundaries, and optimal choices in regions with higher certainty.Deploy a Deep Bayesian Contextual Bandits ModelYou’re out of excuses.. More details

Leave a Reply