When to Run Multi Armed Bandit Tests Instead of A/B/n Tests

When should you use bandit tests, and when is A/B/n testing best?

Though there are some strong proponents (and opponents) of bandit testing, there are certain use cases where bandit testing may be optimal. Question is, when?

First, let’s dive into bandit testing and talk a bit about the history of the N-Armed Bandit Problem.

The Multi-Armed Bandit Problem

The multi-armed bandit problem is a classic thought experiment. Imagine this scenario:

You’re in a casino. There are many different slot machines (known as ‘one-armed bandits’, as they’re known for robbing you), each with a lever (and arm, if you will). You think that some slot machines payout more frequently than others do, so you’d like to maximize this. You only have a limited amount of resources – if you pull one arm, then you’re not pulling another arm. Of course, the goal is to walk out of the casino with the most money. Question is, how do you learn with slot machine is the best and get the most money in the shortest amount of time?

bandit tests

If you knew which lever would pay out the most, you would just pull that lever all day. In regards to optimization, the applications of this problem are obvious. As Andrew Anderson said in an Adobe article:

Andrew AndersonAndrew Anderson:
“In an ideal world, you would already know all possible values, be able to intrinsically call the value of each action, and then apply all your resources towards that one action that causes you the greatest return (a greedy action). Unfortunately, that is not the world we live in, and the problem lies when we allow ourselves that delusion. The problem is that we do not know the value of each outcome, and as such need to maximize our ability of that discovery.”

The Practical Differences Between A/B Testing and Bandit Testing

A/B split testing is the current default for optimization, and you know what it looks like:

ab testing conversionxl

You send 50% of your traffic to the control and 50% of your traffic to variation, run the test ‘til it’s valid, and then decide whether to implement the winning variation.


In statistical terms, a/b testing consists of a short period of pure exploration, where you’re randomly assigning equal numbers of users to Version A and Version B. It then jumps into a long period of pure exploitation, where you send 100% of your users to the more successful version of your site.

In Bandit Algorithms for Website Optimization, the author outlines two problems with this:

  • It jumps discretely from exploration to exploitation, when you might be able to transition more smoothly.
  • During the exploratory phase (the test), it wastes resources exploring inferior options in order to gather as much data as possible.

In essence, the difference between bandit testing and a/b/n testing is how they deal with the Explore-Exploit dilemma.

As I mentioned, A/B testing explores first then exploits (keeps only winner).

Explore Exploit ab testing

Bandit testing tries to solve the explore-exploit problem in a different way. Instead of two distinct periods of pure exploration and pure exploitation, bandit tests are adaptive, and simultaneously include exploration and exploitation.

bandit tests vs ab tests

So, bandit algorithms try to minimize opportunity costs and minimize regret (the difference between your actual payoff and the payoff you would have collected had you played the optimal (best) options at every opportunity.). Chris Stucchio from VWO offers the following explanation of bandits:

stucchioChris Stucchio:

“Anytime you are faced with the problem of both exploring and exploiting a search space, you have a bandit problem. Any method of solving that problem is a bandit algorithm – this includes A/B testing. The goal in any bandit problem is to avoid sending traffic to the lower performing variations. Virtually every bandit algorithm you read about on the internet (primary exceptions being adversarial bandit, my jacobi diffusion bandit, and some jump process bandits) makes several mathematical assumptions:

a) Conversion rates don’t change over time.
b) Displaying a variation and observing a conversion happen instantaneously. This means the following timeline is impossible: 12:00 Visitor A sees Variation 1. 12:01 visitor B sees Variation 2. 12:02 Visitor A converts.
c) Samples in the bandit algorithm are independent of each other.

A/B testing is a fairly robust algorithm when these assumptions are violated. A/B testing doesn’t care much if conversion rates change over the test period – i.e., if Monday is different from Saturday, just make sure your test has the same number of Mondays and Saturdays and you are fine. Similarly, as long as your test period is long enough to capture conversions, again – it’s all good.”

In essence, there shouldn’t be an ‘a/b testing vs. bandit testing, which is better?’debate, because it’s comparing apples to oranges. These two methodologies serve two different needs.

The Benefits of Bandits

The first question to answer, before answering when to use bandit tests, is why to use bandit tests. What are the advantages?

ma bandit tests google

Google Content Experiments uses bandit algorithms. They reason that the benefits of bandits are plentiful:

“They’re more efficient because they move traffic towards winning variations gradually, instead of forcing you to wait for a “final answer” at the end of an experiment. They’re faster because samples that would have gone to obviously inferior variations can be assigned to potential winners. The extra data collected on the high-performing variations can help separate the “good” arms from the “best” ones more quickly.”

In essence, people like bandit algorithms because of the smooth transition between exploration and exploitation, the speed, and the automation.

A Few Flavors of Bandit Methodology

There are tons of different bandit methods. Like a lot of debates around testing, a lot of this is of secondary importance – misses the forest for the trees.

Without getting too caught up in the nuances between methods, I’ll explain the simplest (and most common) method: the epsilon-Greedy algorithm. Knowing this will allow you to understand the broad strokes of what bandit algorithms are.

Epsilon-Greedy Method

“One strategy that has been shown to perform well time after time in practical problems is the epsilon-greedy method. We always keep track of the number of pulls of the lever and the amount of rewards we have received from that lever. 10% of the time, we choose a lever at random. The other 90% of the time, we choose the lever that has the highest expectation of rewards.” (source)

Okay, so what do I mean by Greedy? In computer science, a greedy algorithm is one that always takes the action that seems best at that moment. So, an epsilon-greedy algorithm is almost a fully greedy algorithm – most of the time it picks the option that makes sense at that moment.

However, every once in a while, an epsilon-Greedy algorithm chooses to explore the other available options.


So epsilon-greedy is a constant play between:

  • Explore: randomly select action certain percent of time (say 20%)
  • Exploit (play greedy): pick the current best percent of time (say 80%)

This image (and the article from which it came) explains epsilon-Greedy really well:

epsilon greedy bandit test

There are some pros and cons to the epsilon-Greedy method. Pros include:

  • It’s simple and easy to implement.
  • It’s usually effective.
  • It’s not as affected by seasonality.

Some cons:

  • It doesn’t use a measure of variance
  • Should you decrease exploration over time?

What About Other Algorithms?

Like I said, a bunch of other bandit methods try to solve these cons in different ways. Here are a few:

Could write 15,000 words on this, but instead, just know the bottom line is that all the other methods are simply trying to best balance exploration (learning) with exploitation (taking action based on current best information).


Note: if you want to nerd out on the different bandit algorithms, this is a good paper to check out.

When To Use Bandit Tests Instead of A/B/n Tests?

There’s a high level answer, and then there are some specific circumstances in which bandit works well. For the high level answer, if you have a research question where you want to understand the effect of a treatment and have some certainty around your estimates, a standard a/b test experiment will be best.

“If on the other hand, you actually care about optimization, rather than understanding, bandits are often the way to go.”

Specifically, bandit algorithms tend to work well for really short tests – and paradoxically – really long tests (ongoing tests). I’ll split up the use cases into those two groups.

1. Short Tests

Bandit algorithms are conducive for short tests for clear reasons – if you were to run a classic a/b test instead, you’d not even be able to enjoy the period of pure exploitation (after the experiment ended). Instead, bandit algorithms allow you to adjust in real time and send more traffic, more quickly, to the better variation. “Whenever you have a small amount of time for both exploration and exploitation, use a bandit algorithm.”

Here are specific use cases within short tests:

a. Headlines

Headlines are the perfect use case for bandit algorithms. Why would you run a classic a/b tests on a headline if, by the time you learn which variation is best, the time where the answer is applicable is over? News has a short half-life, and bandit algorithms determine quickly which is the better headline.

Image Source

Imagine you’re a newspaper editor. It’s not a slow day; a murder victim has been found. Your reporter has to decide between two headlines, “Murder victim found in adult entertainment venue” and “Headless Body found in Topless Bar.” As Chris says, geeks now rule the world – this is now usually an algorithmic decision, not an editorial one. (Also, this is probably how sites like Upworthy and Buzzfeed do it).

b. Short Term Campaigns & Promotions

Similar to headlines, there’s a big opportunity cost if you choose to a/b test. If your campaign is a week long, you don’t want to spend the week exploring with 50% of your traffic, because once you learn anything, it’s too late to exploit the best option.

This is especially true with holidays and seasonal promotions. According to Stephen Pavlovich from conversion.com, recommends bandits for short term campaigns:

stephen pavlovichStephen Pavlovich:

“A/B testing isn’t that useful for short-term campaigns. If you’re running tests on an ecommerce site for Black Friday, an A/B test isn’t that practical – you might only be confident in the result at the end of the day. Instead, a MAB will drive more traffic to the better-performing variation – and that in turn can increase revenue.”

2. Long Term Testing

Oddly enough, bandit algorithms are effective in long term (or ongoing) testing. As Stephen Pavlovich put it:

stephen pavlovichStephen Pavlovich:
“A/B tests also fall short for ongoing tests – in particular, where the test is constantly evolving. Suppose you’re running a news site, and you want to determine the best order to display the top 5 sports stories in. A MAB framework can allow you to set it and forget. In fact, Yahoo! actually published a paper on how they used MAB for content recommendation, back in 2009.”

There are a few different use cases within ongoing testing as well:

a. “Set it and forget it” (Automation for Scale)

Because bandits automatically shift traffic to higher performing (at the time) variations, you have a low-risk solution for continuous optimization. Here’s how Matt Gershoff put it:

Matt GershoffMatt Gershoff:

“Bandits can be used for ‘Automation for Scale.’ Say you have many components to continuously optimize, the bandit approach gives you a framework to partially automate the optimization process for low risk, high transaction problems that are too costly to have expensive analysts pour over”

Ton Wesseling, founder of Testing Agency also mentions that bandits can be great for testing on high traffic pages after learning from a/b tests:

ton wesselingTon Wesseling:“Just give some variations to a bandit and let it run. Preferable you use a contextual bandit. We all know the perfect page for everyone does not exist, it differs per segment. The bandit will show the best possible variation to each segment.”

b. Targeting

Another long term use of bandit algorithms is targeting – which is specifically pertinent when it comes to serving specific ads and content to user sets. As Matt Gershoff put it:

Matt GershoffMatt Gershoff:

“Really, true optimization is more of an assignment problem than a testing problem. We want to learn the rules that assign the best experiences to each customer. We can solve this using what is known as a contextual bandit (or, alternatively, a reinforcement learning agent with function approximation). The bandit is useful here because some types of users may be more common than others. The bandit can take advantage of this, by applying the learned targeting rules sooner for more common users, while continuing to learn (experiment) on the rules for the less common user types.”

Ton also mentioned that you can learn from contextual bandits:

ton wesselingTon Wesseling:

“By putting your A/B-test in a contextual bandit with segments you got from data research, you will find out if certain content is important for certain segments an not for others. That’s very valuable – you can use these insights to optimize the customer journey for every segment. This can be done with looking into segments after an A/B-test too, but it’s less time consuming to let the bandit do the work.”

Further reading: A Contextual-Bandit Approach to Personalized News Article Recommendation

c. Blending Optimization with Attribution

Finally, bandits can be used to optimize problems across multiple touch points. This communication between bandits ensures that they’re working together to optimize the global problem and maximize results. Matt Gershoff gives the following example:

Matt GershoffMatt Gershoff:
“You can think of Reinforcement Learning as multiple bandit problems that communicate with each other to ensure that they are all working together to find the best combinations across all of the touch points. For example, we have had clients that placed a product offer bandit on their site’s home page and one in their call center’s automated phone system. Based on the sales conversions at the call center, both bandits communicated local results to ensure that they are working in harmony optimize the global problem.”

Caveats: Potential Drawbacks of Bandit Testing

Even though there are tons of blog posts with slightly sensationalist titles, there are a few things to consider before jumping on the bandit bandwagon.

First, multi-armed-bandits can be difficult to implement. As Shana Carp said on a Growthhackers.com thread:

“MAB is much much more computationally difficult to pull off unless you know what you are doing. The functional cost of doing it is basically the cost of three engineers – a data scientist, one normal guy to put into code and scale the code of what the data scientist says, and one dev-ops person. (Though the last two could probably play double on your team) It is really rare to find data scientists who program extremely well.”

The second thing, though I’m not sure it’s a big issue, is the time it takes to reach significance. As Paras Chopra pointed out, “There’s an inverse relationship (and hence a tradeoff) between how soon you see statistical significance and average conversion rate during the campaign.”


vwo bandit tests

Chris Stucchio also outlined what he called the Saturday/Tuesday problem. Basically, imagine you’re running a test on two headlines:

  1. Happy Monday! Click here to buy now.
  2. What a beautiful day! Click here to buy now.

Then suppose you run a bandit algorithm, starting on Monday:

  • Mon: 1000 displays for “Happy Monday”, 200 conversions. 1000 displays for “Beautiful Day”, 100 conversions.
  • Tues: 1900 displays for “Happy Monday”, 100 conversions. 100 displays for “Beautiful Day”, 10 conversions.
  • Wed: 1900 displays for “Happy Monday”, 100 conversions. 100 displays for “Beautiful Day”, 10 conversions.
  • Thu: 1900 displays for “Happy Monday”, 100 conversions. 100 displays for “Beautiful Day”, 10 conversions.

Even though “Happy Monday” is inferior (20% conversion rate on Mon and 5% rest of the week = 7.1% conversion rate), the bandit algorithm has almost converged to “Happy Monday”, so the samples shown “Beautiful Day” is very low. It takes a lot of data to correct this.

(Note: a/b/n tests have the same problem non-stationary data. That’s why youshould test for full weeks.)

Chris also mentioned that bandits shouldn’t be used for email blasts:

stucchioChris Stucchio:
“One very important note – email blasts are a fairly pooruse case for standard bandits. The problem is that with email, conversions can happen long after a display occurs – you might send out thousands of emails before you see the first conversion. This violates the assumption underlying most bandit algorithms.”



“In general Bandit Based optimization can produce far superior results to regular A/B testing, but it also highlights organizational problems more. You are handing over all decision making to a system. A system is only as strong as its weakest points and the weakest points are going to be the biases that dictate the inputs to the system and the inability to understand or hand over all decision making to the system. If your organization can handle this then it is a great move, but if it can’t then you are more likely to cause more problems then they are worse. Like any good tool, you use it for the situations where it can provide the most value, and not in ones where it doesn’t. Both techniques have their place and over reliance on any one leads to massive limits in the outcome generated to your organization.”

As mentioned above, the situations where bandit testing seems to flourish are:

  • Headlines and Short-Term Campaigns
  • Automation for Scale
  • Targeting
  • Blending Optimization with Attribution
No comments yet.

Leave a Reply