6 AB / MVT Testing Pitfalls Which Will Lead You Astray

When performed correctly, A/B testing holds a treasure chest of prospective benefits and profit earning opportunities. The issue, however, is that A/B tests are frequently run or interpreted incorrectly. Many people forget to control the test settings or exaggerate test successes which leads to disappointing long term performance and even detrimental site changes. Here are the top 7 pitfalls which plague marketers and easy ways to avoid them.

1. Poor Randomization Techniques

The first step of any statistical test is to select a truly random control group, against which you will measure successes or failures. In A/B testing this step usually involves the random division of visitors into two groups which will each view a different page. Although this process may seem simple and easy at first glance, it is not only more complicated but also critical to the tests outcome. Visitors can be broken into hundreds of different groups based on a variety of different categories. Depending on the test you are running it may be important to include or exclude certain groups from your test or break up specific groups as well. If your test is geared towards new visitors then you may want to exclude return visitors; or if your visitors are directed to your site from a variety of different marketing campaigns or organic search for example, then you may want to randomly divide visitors from each category to ensure that the advertisement method doesn’t influence test results. The best way to check if there is any type of group bias is to first run an A/A test rather than an A/B test. An A/A test shows the same page to two different groups but analyzes them as though they are seeing different pages. If there is a significant difference in conversion or click through rates then there may be a bias within your randomization function.

2. Measuring The Wrong Indicators

Once a test has been designed and is being run, marketers frequently analyze changes in a singular performance indicator. Although this effectively demonstrates how a change influences that indicator, it fails to reveal conflicting trends or more widespread changes. Rather than setting a singular KPI such as click-through-rate for an action button or conversion rate, analyzing a variety may reveal a more important trend. While conversion rates are almost always the top benchmark, looking at the bounce rate, the average time spent on the site, or revenue per visitor (RPV) will provide a more in-depth analysis. For example, a different font on an action button may increase the click through rate but if it simultaneously increases the bounce rate then the change may not be beneficial. Alternatively, if a change maintains the same number of conversions but increases the RPV then it is still a successful change. In the end, knowing more about how visitors actions were different may help answer the difficult marketing question of why?.

3. Short Testing Periods and Small Sample Sizes

Limiting the time and size of your test can not only lead to bias but also undermine its statistical significance. Marketers frequently end their tests as soon as there appears to be a notable success or failure without making sure the data is statistically significant. It is important to decide on a sample size and test length BEFORE running the A/B test. In most cases you should ensure that a certain conversion threshold is met and that you are only counting unique visitors. Both of these concerns can play a major role in skewing result data. Additionally, tests should be run for at least 2 weeks in order to prevent bias based on the day of the week or period of the year. Recognize seasonal or holiday bias; it may be worth re-running a test at another period to check test results. Finally, do NOT end a test early. It may seem like your new design has produced the desired effects but you cannot be sure until the agreed upon period of time and number of samples have been taken.

4. Forgetting Statistically Significance

In order to determine the size of your test sample, power analysis should be used in order to verify that your data will be statistical significant. You can use this tool to find an optimal sample size.

The results of an A/B test can be deceptively positive at times. Occasionally a test page will outperform the control conversion rate by 20 or 30 percent. While these results appear very promising, it is important to view them through a statistical lens with a skeptical eye. In order to determine whether the new page’s success is actually the result of the implemented changes, a T-test should be performed. Without getting into the specifics, this test produces a Z-score which measures the confidence interval of your data and determines whether or not a change was the result of randomness. You may use a significance calculator in order to calculate the confidence level or a general explanation can be found here. If your data provides a confidence level above 95% then its result is significant. It can then be assumed that there is a correlation between the test page change and the increased conversion rate.

5. Settling and Local Maximums

You keep trying new fonts, page organizations, image sizes, colors, themes, etc. but no longer see any type of change (if anything your tests show negative returns), what do you do? Many people hit this phase and think that it means their site has hit its optimal position. In reality this is most likely an example of the Local Maximum Theory. The Local Maximum Theory suggests that within the basic space in which you are operating you may have reached a maximum but there is still a lot of room for improvement. The image below demonstrates the basic concept of this theory. In order to reach the next level radical change is probably necessary. Now this doesn’t mean scrap your site and restart but instead try to look at it from a different perspective and analyze its core values and think how to increase consumer engagement. A/B testing is a great way to see what new direction may be best. Remember to give tests long enough because as is seen in the graph a slight dip in conversion can become exponential growth.

6. Correlation Does Not Equal Causation

A/B testing is a great statistical asset and source of information. Statistics, however, answers the question of what NOT why. While A/B testing has the capability of compiling, analyzing, and summarizing data, it does not absolutely demonstrate causality. Using micro-KPIs can help identify false causality. For example if you have just added a site security verification sticker and see an immediate rise in the conversion rate, then you would assume that the sticker caused the rise. It is possible, however, that mouse hover data would show that none of the customers even looked the sticker. This makes it difficult to determine if there was actually a causality or maybe there is a flaw in the testing apparatus. This could be the result of some other minor change or the time period of the test or it could be an example of poor randomization and an A/A test should be performed. There are thousands of other examples of non causal correlations. One of the most extreme is the correlation between the number of pirates and the average global temperature. Check it out:

Clearly there is no causality here but there is statistical correlation. Be conscious of these constraints and remember that A/B testing is an asset which cannot be applied to every problem.

Avoiding these pitfalls will not only save you time and money but also vastly improve the way that your company makes decisions. And remember – always be testing!

No comments yet.

Leave a Reply