Failing your way to success with A/B Testing

So I've read and very carefully considered the words of Douglas Bowman and his decision to leave Google and much of this was about Google's use of A/B testing ... Google's extensive use of A/B testing. Google engineers clearly valued function over form. He complains about a situation where 41 shades of blue were tested for optimum response (see page 3) and makes it clear that he felt constrained. No matter what he did, how simply he did it, he had to "prove" it worked. He simply wanted to build something beautiful and not worry about extreme micromanagement of every aspect.

I can understand that. Many times I want to just "make stuff happen", secure in the knowledge that I know what I'm doing. I look at how foolishly code is implemented and I think "what our customers really want is X!". And A/B testing shows me, over and over again, that I'm wrong. It's humbling. It's humiliating. And if you really care about being the most effective you can be, you'll love it, but only if your pride can handle being proved wrong on a regular basis.

The core of A/B testing is rather simple. You have your current behavior (often referred to as the null hypothesis) and one or more "variants" of that behavior. In the context of a Web site, you randomly show every visitor one and only one variant of the site and ensure that if they return, they see the same variant. For example, if you sell shoes, does providing more than one picture of that shoe lead people to buy more or does a slower download time kill the sale? If you can provide a matching belt to go with those shoes, does suggesting that lead to better sales or does it distract customers from buying? Over time you can use fairly basic statistics to verify that one of those variants is (statistically) better than the others (like many things in this post, it's a gross oversimplification).

In my personal experience and in reading about others successes and failures, I've discovered something amazing: most A/B tests perform the same or worse as their original behavior. These aren't people making things up: these are people saying "I think this can improve our conversion". These are often experts in their field insisting that the two-column layout will make more money than the three-column layout. For most companies, whoever has the most important title or argues the most persuasively will win and a particular layout will be chosen and they'll move on, completely ignorant that they may have improved their conversion rate 25% by using to the other layout. In contrast, companies which use A/B testing will fail and fail and fail, until they have one test which is successful, makes them plenty of money, and they throw away the rest and start over (another gross oversimplification). They keep accumulating these small successes (and occasionally huge ones) and these add up over time. Instead of guessing what works, they know what works. Instead of launching a bunch of features and hoping their customers like them, they know their customers like them (and don't forget that just because customers want something doesn't mean that the enormous expense involved is going to induce them to spend more money).

A/B testing slays egos. Like the developer who learns to stop "optimizing" their code and start benchmarking it, A/B (and multivariate — MVT) testing teaches you to stop saying "this is better" in favor of "let's find out what's better". Instead of benchmarking your code, you're benchmarking your customers.

It's no wonder that the market for A/B and MVT testing is exploding. Not only is it proving to be dramatically successful for many companies (read some of the case studies for one A/B testing solution provider), for people who truly care about producing results rather than repeating dogma, A/B testing is where it's at. No longer do you worry about killing your business with a redesign; you gradually evolve that new design based on what people actually do rather than what they say they do or on your personal hunches.

Sadly, because we have an exploding market in this area we also have snake oil salesmen (a little voice in the back of my head says "you can have snake oil saleswomen too!"). I'm finding online A/B testing calculators which get their basic math wrong ("maths", Dave. Are you happy now? :) I'm finding "A/B testing consultant" web sites where they confuse standard deviation and standard error, an error which does not inspire confidence. I'm seeing people say "A/B testing" is the only way to go and others say "MVT is the one true path, but they don't mention the strengths and weaknesses of different approaches. It's a big, scary and confusing world out there and for those of you who don't believe me, here's a picture of my wife:

My queen
My wife with an orange thing on her head.

A/B (and MVT) testing is not exactly Perl, but I suspect some developers are close enough to this area that you might be interested. If you are (or at least if people don't scream stop), I might write more about this later.

10 Comments

Yes but what we really really want to know is, did your wife A/B-test this orange thing on her head ?

Yes please - more like this.

How do you A/B test when there's no sale or consummation to measure?

How would Wikipedia or the BBC do A/B testing?

And do you have multiple blogs because you're A/B testing them? :P

What I am most afraid of in the kind of A/B testing I've seen is that it seems very prone to getting stuck in local minima. In that image, imagine you're in the place marked with the red arrow. Taking small steps and adjusting for the test result each time, you're never going to reach the optimal place (blue arrow) since once you're in the local minimum, all small steps you can take actually produce a worse result.

In practice, you can often eschew this problem to some extent because a) you're not in 2D space but a much higher dimensionality and b) the space actually changes under your feet, so the A/B testing allows you to at least follow the local optimum. For non-discrete cases (i.e. not infinite dimensions with binary states like "red or blue menu bar?" but a limited set of things that have a wide range of settings), there's techniques that can help overcome this. I can't see them apply easily to website design and similar things, though.

ObGoogle: Arguably Google need to sell something (via ads) so that they can do A/B and make their search better. How else would they know it was working?

Andy: Basically, you "just" need to come up with a metric (or set of metrics) that you would like to optimize. Conceivably, the BBC, for example, would like to have more people using their public website. The number of unique visitors or average number of pages viewed in a session would be possible metrics that correlate in one way or another with the ulterior goal.

But then again, you'd rather not attract a ton of people who simply turn away in disgust after the first impression. At the same time, having people click through 25000 pages to get the information they need is hardly desirable (unless you make your money by showing ads). So in practice, picking the metrics and then analyzing the result of your tests will always be hard and require both thinking out of the box and solid statistics to get right.

How do you A/B-test your way out of a local optimum?

(I see Steffen brought up this concern already. I’m glad, because it is the only systemic problem I see in the approach – and bafflingly, one that most of its detractors seem to miss entirely, getting lost instead in various nit-picky details and obvious non-problems.)

A/B testing is best left for small optimisations, and other forms of user testing are best for large changes/creations. There's a gamut of different types of user testing, and web sites (and businesses) who want to be successful would be smart to use them all (or as many as reasonable). Also: test early, test often.

What's nice about A/B testing is, unlike some kinds of "tests" where people (users, customers) are asked what they LIKE, this shows more valuable information: what they DO. Behaviour.

But A/B testing won't tell you why something failed. You try a blue button instead of a red button, and behaviour changes. But why? Was one button harder to read? Harder to find? Did people think it meant something else? Other kinds of user testing compliment A/B testing.

Then, there is the time variable. The initial response from your users may not be the same as the response you could get after months or years of usage.

For how long are you able to run A/B testing (or for that matter, any other statistical approach)?

What do you think A/B testing would say about "vi"?

Google might be spoiled by the sheer numbers they've got, so it might be easier to analyse by number. But a designer would get mad of course, because you cannot design by numbers, there are general rules, what is good and what not.

My analogy is film ratings:

You cannot just say by analysing the average and stddev of film critics ratings of the best films in cannes which picture will win, or which is the best film of the festival. All critics knew before that "Tree of life" will be the best picture but by looking at the numbers it was impossible to predict. "Tree of life" was 3rd, with a large margin. See http://rurban.xarch.at/film/Cannes2011.txt, the ranking done in perl.


- 1. You have to trust your experts.
- 2. Juries always pick the wrong.

@aristotle - if you use A/B testing with only small steps then you can be stuck with local minima - but you can always guess the global minimum and then A/B test it with other solutions. Since the opposite to A/B testing would be guessing anyway - I don't see how this is a limitation.

Leave a comment

About Ovid

user-pic Have Perl; Will Travel. Freelance Perl/Testing/Agile consultant. Photo by http://www.circle23.com/. Warning: that site is not safe for work. The photographer is a good friend of mine, though, and it's appropriate to credit his work.