Sample Pollution: The A/B Testing Problem No One is Talking About

Here’s an uncomfortable truth about conversion rate optimization: lots of people are running bad tests without even knowing it. They’re making decisions based on false positives, they’re acting on inconsistent data, they’re avoiding the issue of sample pollution – I could go on.

“If you don’t know about sample pollution, stop testing.”

The first step to running better tests is to really understand sampling. Is your sample large enough? Is it representative? Is it bias? If you don’t know how to sample correctly and preserve the quality / validity of your data, you’re wasting your time.

Sampling 101

What Is Sample Size?

Well, it’s exactly what it sounds like. It’s how many visitors or conversions you need in your test. It is incredibly important to calculate your sample size upfront. You don’t stop when you have 95% confidence, you stop when you have enough traffic / conversions for valid results.

If you don’t calculate your sample size before you run your test, you’ll run into bad data, likely without even realizing it.

So, how do you calculate your needed sample size? Just use Evan Miller’s sample size calculator (one tail tests only), Optimizely’s sample size calculator (one tail or two tail tests) or other similar ones.

Optimizely's Sample Size Calculator

All you have to do is plug in the current conversion rate of the page you want to run the test on, minimum detectable effect (how big of an uplift you want to be able to detect), significance and power levels (usually defaulted to 95% and 80% respectively).

Who Should Be In Your Sample?

Calculators can do the math and tell you how many people to sample, but they can’t tell you who to sample. Take a second to think about the different types of traffic that flow into your site every week and how their unique traits will impact your test results.

1. Weekday vs. Weekend Traffic

Does your Wednesday traffic behave the same way your Saturday traffic does? Chances are, the day of the week and time of day have an impact on the type of people visiting your site. The mindset people have on a Friday night is going to be different from a Monday morning. You want your test to include all of those people.

2. Source of Traffic

Different traffic sources engage with your site in different ways. Open Google Analytics, select “Acquisition” and then “Overview”. Note how traffic from the sources behave and convert in different ways.

Acquisition Overview in Google Analytics

Will a visitor from Google search behave the same way a visitor from a Twitter ad will? Traffic sources are not created equal, so ensure that your sample includes traffic from all of the sources or ensure your test results from a single source are not applied to all sources.

3. Returning vs. New Traffic

Will a first time visitor behave the same way a third time visitor will behave? What about a 22nd time visitor? 100th time?

Returning visitors to an e-commerce site stay for an average of 3 minutes longer than new visitors. New visitors only view an average of 3.88 pages per visit while returning visitors view an average of 5.55 pages per visit.

Representative Samples

The key here is to have a representative sample. If you don’t account for the different types of traffic that will find their way into your sample, you’ll notice regression to the mean.

For example, let’s say you run a simple A/B test from Tuesday to Thursday. On Wednesday, your Director of Marketing switched on a Facebook ad campaign. According to your Director of Analytics, you see 2x more traffic on Saturdays. In the end, you see a lift in B, so you push that variant live.

By the following Tuesday, you’re no longer seeing that lift… it’s regressed back to the mean because the Tuesday to Thursday sample wasn’t representative.

Even weeks might be different.

Takeaway: To ensure you have a representative sample in your test, run it for at least 2 business cycles. That will be 2 to 4 weeks for most businesses.

What Is Sample Bias?

Sample bias is alive and well. Marketers can and do get attached to their hypotheses. Time and effort goes into collecting data, analyzing it and coming up with a hypothesis. It’s easy to become emotionally invested in the outcome. You want your hypothesis to be correct.

In some cases, marketers will stop tests early or incorrectly analyze results to achieve their desired outcome. Often, they are not consciously aware that they are tampering with the results. That’s why experts practice double blind experiments.

After designing the test, the test designer asks someone else on the team to change the variant names and numbers so that he too is blind to the results. He can’t favor his hypothesis because he no longer knows which is the control.

After the data comes in and has been analyzed, he can declare a winner. Only then does he find out if his hypothesis was correct, thus eliminating his natural bias.

What’s Sample Pollution?

Sample pollution is any uncontrolled external factor that influences your tests and results in invalid data. That’s a pretty broad definition, primarily because there are many different types of pollution. Unfortunately, many marketers aren’t even aware pollution exists, which means they’re unknowingly making business decisions based on invalid test results.

To identify whether or not your tests are polluted, run through the four most common types of sample pollution.

1. Length Pollution

As you know, time is a major factor when it comes to testing. “How long should I let my test run?” is one of the most common A/B testing questions I receive.

The truth is that there is no universal test length.

Working with a high traffic website, one week of consistent data is a representative sample. For most people, one week of data would be a convenient sample, not a representative sample. In other words, most people will need to run their tests for much longer than a week.

So while there are no absolutes, two full business cycles is standard. Therefore, most people shouldn’t be testing for less than two weeks. Four weeks is even better for the sake of validity, especially for more complicated and expensive products.

Length pollution comes into play when you stop a test too early or let a test run too long. So, if you stop a test when you reach significance instead of when you’ve reached the full sample (e.g. 400,000 visitors), you have length pollution.

On the other hand, the longer a test runs, the more likely it is that external factors (e.g. holidays, technical issues, campaign changes, etc.) will impact your test, producing invalid results.

How to Limit It:

  • Consistent data is extremely important. If your sample includes anomalies (e.g. holidays, seasonal exceptions, etc.), you are acting on inconsistent behavior. Your winning variant likely won’t produce the same results next month. Look for spikes and pitfalls in your data to avoid bias and ensure a representative data sampling.
  • Run tests for as long as you need to. Once you have collected data from a mathematically significant sample size, call it. The longer your test runs beyond that, the higher the risk of pollution (i.e. changes in context).
  • Avoid running A/B tests during major holidays (e.g. Christmas) that might skew the results. It’s better to run holiday-specific test campaigns during those times (and most likely bandit tests).

2. Device Pollution

80% of Internet users own a smartphone, but that shouldn’t be news to you. This might be: 91% of people use their desktop / laptop to browse the Internet, 80% use their smartphone, 47% use their tablet, 37% use their gaming console and 34% use their smart TV. Smartwatches and wristbands are following closely behind.

Device Use Stats

The point is that we all own and use multiple devices to browse the Internet, which means some of your visitors are being included 2-3 times because they’ve switched devices.

For example, if Visitor A lands on your site via her iPhone 5S and then two days later lands on your site via her laptop, there’s a 50% chance she’s seeing a different variant than she did originally.

Also consider that many devices have multiple users. Home computers, shared iPads, shared gaming consoles – not all devices are 100% personal.

For example, let’s say Brian visits your site from his family’s home computer on Monday and doesn’t convert. On Wednesday, Brad, Brian’s younger brother, visits your site from the same home computer and does convert. Your testing platform has considered both of those people to be the same person.

How to Limit It:

3. Browser Pollution

Browser pollution is very similar to device pollution. While most of us have a preferred browser, we end up using others out of necessity. Chrome has the plugin you need to watch American Netflix, but Safari has iCloud integration. Again, we run into the issue of including visitors 2-3 times, simply because they’ve switched browsers.

While there is no credible data on how often people switch between browsers on any given day, it is often enough that apps have been developed to make it easier.

Finally, if you use Google Chrome (and over 60% of the population does), you know about incognito mode browsing (others browsers have a similar mode, but call it something else). If someone opens a tab in incognito mode and visits your site, existing cookies will not be used and new cookies will not be saved after the tab is closed.

So, if the visitor returns to your site at a later time, she is appearing as a new visitor and, again, has a 50% chance of seeing a different variant than she did originally.

How to Limit It:

  • Just as you can run tests separately for each type of device, you can run tests separately for each browser as well – but that’s a game for large-traffic websites. That said, due to the fact there is limited information available, it’s likely that fewer people are switching between browsers regularly than they are devices.
  • Once again, you can consider Google Analytics Universal and known IDs.

4. Cookie Pollution

At the root of all of this pollution is cookies. If you’re not familiar with a web cookie, it’s a small piece of data sent from a website and stored in a visitor’s web browser. It allows marketers to more accurately track their sample behavior.

Unfortunately, the data surrounding the rates at which cookies are deleted is done are shaky at best. Most reports are outdated (circa 2005) and, ironically, non-representative of the entire Internet population.

Lifehacker conducted a simple poll, asking readers how often they delete their browser history. Here are the results as of September 23rd, 2015:

Browser History Deletion

Of course, this is a relatively small sample of tech-savvy people. It’s highly unlikely that 20% of the entire Internet population deletes their history whenever they close their browser.

In 2012, Econsultancy released a report (based on 1,600 online respondents) that found that 73% of respondents regularly manage their cookie settings using their browser. When asked what they would say if a website asked for their permission to set cookies when they visit, only 23% responded yes (60% responded “maybe”).

40% of respondents think cookies are bad for the web.

Regardless of the exact numbers, the longer your test runs, the more likely it is that visitors will delete their cookies, polluting your sample. Just two weeks can make a difference!

How to Limit It:

  • Don’t leave tests running for longer than necessary. The longer the test runs, the more opportunities there are for pollution. When you calculate your needed sample size ahead of time, make sure you can run the test in 4 weeks or less – or the pollution might start affecting the credibility of results.
  • There are ways to track visitors without the use of cookies. However, it is less mainstream and much more complicated – stuff like browser fingerprinting (declared illegal in European Union).
  • People are beginning to develop cookie alternatives, like Evercookie. Evercookie is extremely persistent, storing the data in more places and recreating the data after it has been deleted.

How to Avoid Sample Pollution

1. Ensure Consistency of Data

Consistent data, which means your data is unaffected by coincidental anomalies. Essentially, your data is behaving the way it normally does without the interference of unusual spikes or pitfalls.

You cannot replace understanding patterns, looking at the data, and understanding its meaning. Nothing can replace the value of just eyeballing your data to make sure you are not getting spiked on a single day and that your data is consistent. This human level check gives you the context that helps correct against so many imperfections that just looking at the end numbers leaves you open for.”

In layman’s terms? Don’t run a test when your data is not behaving the way it normally does. Without consistency of data, your test results will not be applicable the majority of the time.

  • Many A/B testing platforms do not account for data consistency.
  • Open up your testing platform of choice and dive into the data yourself.
  • Look for spikes and other anomalies. Did you see an uncharacteristic surge in conversions on Tuesday?
  • If so, run the test again.

2. Understand Variance

Variance and standard deviation goes hand-in-hand with consistency. Essentially, they will tell you how far away from the average your numbers are. A low variance means your data is consistent with the average, which puts you at lower risk of pollution.

You can do the math manually yourself or just use a simple standard deviation calculator.

Variance is a major factor in eliminating sample pollution.

In classic testing you want to preselect an audience size, but that ignores the realities of self selection bias as well as the thousands of small ways population sampling can get messy (let alone tracking is messy). By accounting for variance and by paying attention to the pattern of data, you can mitigate that risk (though never get rid of it).”

Accounting for variance and consistency, instead of just significance, will help prevent and reduce your sample pollution.

Avoiding A/B Testing Risk

Sample pollution is simply unavoidable, no matter how long the test runs for. You can mitigate the risk, but never eliminate it. There is risk in every test you run, which means you have to be aware of the risk and analyze it before it begins.

The key is not to avoid the risk, but manage and balance it.


Running better tests starts with sample quality.

Is your sample large enough? Set your significance, power, and MDE and then check with a sample size calculator. Is it representative? Check your data personally for consistency and use a standard deviation calculator to account for variance. Is it bias? Consider running a double blind test so that favoritism doesn’t play a factor.

Before you run your next test, run through the following steps:

  1. Calculate your sample size using significance, power and MDE.
  2. Recognize the different types of traffic (e.g. weekday vs. weekday, source, new vs. returning) and their uniqueness.
  3. Eliminate your own personal bias as a tester.
  4. Run the test until you have reached your sample size, not until you have reached significance.
  5. Analyze your data personally to spot inconsistencies (and high variance) that might render the test invalid.

You’ll never fully eliminate sample pollution. Cookies will be deleted, multiple devices and browsers will be used, and environmental and technical factors will come into play. You can, however, minimize and isolate the pollution so that you’re acting on statistically significant (and relevant) data.

No comments yet.

Leave a Reply