A very common scenario: a business runs tens and tens of A/B tests over the course of a year, and many of them “win”. Some tests get you 25% uplift in revenue, or even higher. Yet – when you roll out the change, the revenue does not increase 25%. And 12 months after running all those tests, the conversion rate is still pretty much the same. How come?
The answer is this: your uplifts were imaginary. There was no uplift to begin with. Yes, your testing tool said you have 95% statistical significance level, or higher. Well that doesn’t mean much. Statistical significance and validity are not the same.
Statistical Significance Is Not A Stopping Rule
When your testing says that you’ve reach 95% or even 99% confidence level, that doesn’t mean that you have a winning variation.
Here’s an example I’ve used before. Two days after starting a test these were the results:
Is this a statistically significant result? Yes it is, punch the same numbers into any statistical significance calculator, and they will say the same. Here are the results using this awesome significance calculator:
So a 100% significant test, and 800+ percent uplift (or rather Control is over 800% better that the treatment). Let’s end the test, shall we – Control wins!? Or how about we give it some more time instead.
This is what it looked like 10 days later:
That’s right, the variation that had 0% chance of beating control was now winning with 95% confidence. What’s up with that? How come “100% signficance” and “0% chance of winning” became meaningless? Because they are.
If you end the test early, there’s always a great chance that you will pick the wrong winner. In this scenario many (most?) businesses still go ahead and implement the change (roll out the winning variation to 100% of traffic), while in fact the 800% lift becomes zero, or even negative (losing).
Even worse than the imaginary lift that you got, is the false confidence that you now have. You think you learned something, and go on applying that learning elsewhere on the site. But the learning is actually invalid, thus rendering all your efforts and time a complete waste.
It’s the same with the second test screenshot (10 days in) – even though it says 95% significance, it’s still not “cooked”. Sample is too small, the absolute difference in conversions is just 19 transactions. That can change in a day.
You should know that stopping a test once it’s significant is deadly sin number 1 in A/B-testing land. 77% of the A/A-tests (same page against same page) will reach significance at a certain point.
Learn What Significance Really Is
Statistical significance is not a stopping rule.That alone should not determine whether you end a test or not.
Statistical significance does not tell us the probability that B is better than A. Nor is it telling us the probability that we will make a mistake in selecting B over A. These are both extraordinarily commons misconceptions, but they are false. To learn what the p-values are really about, read this post.
Run Your Tests Longer
If you stop your tests after a few days in, you’re doing it wrong. It doesn’t matter if you get 10,000 transaction per day – absolute number of transactions matters, but you also need pure time.
One of the difficulties with running tests online is that we are not in control of our user cohorts. This can be an issue if the users distribute differently by time and day of week, and even by season. Because of this, we probably want to make sure that we collect our data over any relevant data cycles. That way our treatments are exposed to a more representative sample of the average user population.
Notice that segmentation doesn’t really get us out of this, since we will need to still sample over the weekdays/weekends etc, and we probably want to hit each day or day part a couple of times to average out and external events that could be affecting traffic flow/conversion in order to get good estimates of impact time based features/segments on conversion.
I see the following scenario all the time:
- First couple of days: B is winning big. Typically due to the novelty factor.
- After week #1: B winning strong.
- After week #2: B still winning, but the relative difference is smaller.
- After week #4: regression to the mean – the uplift has disappeared.
So if you stopped the test before 4 weeks (maybe even after a few days), you think you have a winning variation at your hands, but you don’t. So if you rolled it out live, you have what I call an “imaginary lift”. You think you have a lift because your testing tool showed +25% growth, but you don’t see growth on your bank account.
Run your tests longer. Make sure they include two business cycles, have enough absolute conversions / transactions and have had enough duration time wise.
Example: imaginary lift
Here’s a test that we ran for an eCommerce client. Test duration was 35 days, it targeted desktop visitors only, and had close to 3000 transactions per variation.
Spoiler: the test ended with “no difference”. Here’s the Optimizely overview for revenue (click to enlarge):
Let’s see now:
- First couple of days, blue (variation #3) is winning big – like $16 per visitor vs $12.5 for Control. #Winning! Many people end the test here. (Fail).
- After 7 days: blue still winning – and the relative difference is big
- After 14 days: orange (#4) is winning!
- After 21 days: orange still winning!
- End: no difference
So – had you ran the test less than 4 weeks, you would have called the winner wrong.
The Stopping Rule
So when is a test cooked?
Alas, there is no universal heavenly answer out there, and there are a lot of “depends” factors. That being said, you can have some pretty good stopping rules that will get you to the right path in most cases.
Here’s my stopping rule:
- Test duration: at least 3 weeks (better if 4)
- Minimum pre-calculated sample size reached (using different tools). I will not believe any test that has less than 250-400 conversions per variation.
- Statistical significance at least 95%
This might be different for some tests because of whatever peculiarities, but in most cases I adhere to this.
Here’s Ton Wesseling chiming in again:
You want to test as long as possible – at least 1 purchase cycle – the more data, the higher the Statistical Power of your test! More traffic means you have a higher chance of recognizing your winner on the significance level your testing on! Because … small changes can make a big impact, but big impacts don’t happen too often – most of the times, your variation is slightly better – so you need much data too be able too notice a significant winner.
BUT – if you tests lasts and lasts, people tend too delete their cookies, 10% in 2 weeks… when they return in your test, they can end up in the wrong variation – so, when the weeks pass, your samples pollute more and more… and will end up having the same conversion rates. Test for a maximum of 4 weeks.
What if after 3 or 4 weeks the sample size is less than 400 conversions per variation?
I will let the test run longer. If by 4 weeks time the sample size is not achieved, I will add another week.
Always test full weeks at a time. So if you start the test on a Monday, it should end on a Sunday. If you don’t test a full week at a time, you might be skewing your results. Run aconversions per day of the week report on your site, see how much fluctuation there is. Here’s an example:What do you see here? Thursdays make 2x more money than Saturdays and Sundays, and the conversion rate on Thursdays is almost 2x better than on a Saturday.
If we didn’t test for full weeks, the results would be inaccurate. So this is what you must always do: test full weeks at a time.
Keep Segments In Mind: The Same Stopping Rule Applies For Each Segment
Segmenting is key to learning from A/B tests. It’s very common that even though B might lose to A in overall results, but B beat A in certain segments (e.g. Facebook traffic, mobile device users etc).
Before you can analyze any segmented data, you need to make sure that you have enough sample size within the segment itself too. So 250-400 conversions PER variation within that one segment you’re looking at.
I even recommend that you create targeted tests (set target audience / segment in the test configuration) instead of analyzing the results across segments after the test. This helps you make sure that tests are not called early, and each segment has adequate sample size.
I always tell people that you need a represantative sample if your data should be valid. What does ‘representative’ mean? First of all you need to include all the weekdays and weekends. You need different weather, because it impacts buyer behavior. But most important: Your traffic needs to have all traffic sources, especially newsletter, special campaigns, TV,… everything! The longer the test runs, the more insights you get.
We just ran a test for a big fashion retailer in the middle of the summer sale phase. It was very interesting to see how the results dramatically dropped during the “hard sale phase” with 70% and more – but it recovered 1 week after the phase ended. We would never have learned this, if the test hadn’t run for nearly 4 weeks.
Our “rule of thumb” is this: 3000-4000 conversions per variation and 3-4 week test duration. That is enough traffic so we can even talk about valid data if we drill down into segments.
→ Testing “sin” no 1: searching for uplifts within segments although you have no statistical validity – e.g. 85 vs 97 conversions – that’s bullshit.
Learning from tests is super important – even more so than getting wins. And segmenting your test data is one of the better ways to learn. Just make sure your segments have enough data before you jump to conclusions.
Just because your test hits 95% significance level or higher, don’t stop the test. Pay attention to the absolute number of conversions per variation and test duration as well.