A/B Split Testing or Avoiding The Hippo

Google the term “Highest Paid Person’s Opinion.”  You’ll find there are lots of blog posts on it already.  They’re all very high level, and I’m a detail-oriented person, so this will be different.  If you want to get some background on it, I’ve added an additional reading section.  I’ll go over the high level comparatively briefly and dive right into technical details.

Here’s why you do it

Highest Paid Person’s Opinion, or Hippo, is how decisions usually get made.  Everyone’s vulnerable to this.  There’s uncertainty, and the expensive people have the positions of power, so we defer to them whether we want to or not.  But when you’re building software, you’re making things nobody’s ever seen before.  How do you know what people will like?  Unless your team employs God or Miyamoto, you can’t see into the future.

You can’t see into the future, but that doesn’t mean you should throw away your ability to see into the present.

It’s basic scientific method – you need to verify against a control group.  You are not your customers, and you have to have an objective way of knowing what they want. Build your Hippo’s new feature, as rough as you like, and roll it out to some customers at random. Then measure how their behavior differs. That’s the only way you’ll discover counter intuitive customer preference.

In the reddit comments on Timothy‘s second continuous integration article, many of the posters bemoan the fact that he works at IMVU (yes, I work at IMVU too):

“This kind of thing is what makes me really depressed about writing software sometimes. I read this guy’s blog, discover a thought-provoking approach I’d never seriously considered before, and then go to see the (surely awesome) website/software he mentioned that they develop using this process. A few clicks later and… it’s some 3D avatar chat bullshit? I want to cry.” -plain-simple-garak

“why do they need to role out 50 code changes a day for such a shitty website, what exactly are they doing?” -joe90210

Well, IMVU isn’t for programmers on reddit. Nor is it for ycombinator intelligentsia. Internet nerds are not our target audience. Here’s the thing, though – we’re internet nerds too, and we’re on reddit and ycombinator. What we’d want isn’t always what our customers want. It’s split testing that teaches us what to build for our customers. It gives us proof that people like our stuff, and it has significantly impacted the direction of the company. We are able to delight an audience whose tastes are not our own.

That’s nice, but how does it work?

Say your codebase is one function, do_the_old_thing(), and you’ve just implemented the fancy new feature do_the_new_thing().  You do this:


$probabilities = array(0.5, 0.5);
$branches = array('control', 'hypothesis');
$experiment_id = set_up_experiment(
  $branches,
  $probabilities,
  $initial_rollout_percentage);
if(customer_gets_behavior(
  $experiment_id,
  $customer_id
  ) == 'hypothesis')
    do_the_new_thing();
else
    do_the_old_thing();

(Disclaimer: this is all on-the-spot php, not IMVU’s actual code. There might be giant bugs.)
Obvious, right?  The magic all happens behind the scenes in customer_gets_behavior().  It has to be random but also idempotent.  Once a customer has been randomly selected to get either the control or hypothesis behavior – in a manner transparent to him or her – you have to keep giving him or her that same behavior for the duration of the experiment.  That means you’ll have to tag each customer with which branch of the experiment they’re in. Here’s how it might look:


if(in_a_branch($customer_id, $experiment_id))
    return get_branch($customer_id, $experiment_id);
$branches = get_branches($experiment_id);
$probabilities = get_probabilities($experiment_id);
$assignment_index = rand(0,1);
$cum_prob = 0;
foreach(range(1, count($branches)) as $i) {
    $cum_prob += $probabilities[$i];
    if($assignment_index < $cum_prob)
        set_branch($customer_id, $branches[$i]);
        return $branches[$i];

In addition, set_up_experiment should also be idempotent, and it needs to record the timestamp from when it was first called.  Then you push this code live and wait for statistical significance!    You’ll need a way to track the customer behaviors you expect to be affected by the different code behavior, so the customer-tagging will come in handy.
You may wonder, “Why can’t I just push my code live to everyone and see how everything changes before and after the push?” Well, setting aside all other answers to that question, there’s a flaw with the mindset asking it: you are assuming you’re only going to be changing one thing at once. If you’re agile, that’s never the case.

Everything has gotchas

You may think it would be cool to use your experiment system for rollout.  The tempting use case is pushing an experiment live and then slowly increasing the probability in the hypothesis branch.  After all, you don’t want to push an experimental feature live to a huge set if it might hurt your revenue, right?  You have to fight this urge.  It will bias your results.  Consider an extreme case:  you push your code live with 90% control and 10% hypothesis and leave it that way for a month.  It doesn’t destroy your business, so you switch it to 10% control and 90% hypothesis and leave it for another month.  Then you review your results.  They say the hypothesis is a big winner!  Do you see the problem?

Here’s a hint: remember, customer_gets_behavior() is idempotent.

The problem is that you’ve introduced another variable into the experiment, corrupting the results. During that first month, up to 90% of your existing customers get permanently assigned to the control branch. Then, when you switch to 90% hypothesis, only people who have never used that area of your product before will be selected into the hypothesis branch with 90% probability. That means you’ve heavily biased the hypothesis branch towards new users and the control branch to existing ones. Here’s the moral of that story:
Every time you change probabilities for an active experiment, you bias the outcome.
The solution is to couple it with a rollout functionality:


$probabilities = array(0.5, 0.5);
$branches = array('control', 'hypothesis');
$initial_rollout_percentage = 10;
$experiment_id = set_up_experiment(
    $branches,
    $probabilities,
    $initial_rollout_percentage);
if(customer_gets_behavior(
  $experiment_id,
  $customer_id
  ) == 'hypothesis')
    do_the_new_thing();
else
    do_the_old_thing();

Now, customer_gets_behavior() looks like this:

if(in_a_branch($customer_id, $experiment_id))
    return get_branch($customer_id, $experiment_id);
$rollout_percent = get_rollout($experiment_id);
$branches = get_branches($experiment_id);
if($customer_id%100 > $rollout_percent)
    return branches[0];

$branches = get_branches($experiment_id);
$probabilities = get_probabilities($experiment_id);
$assignment_index = rand(0,1);
$cum_prob = 0;
foreach(range(1, count($branches)) as $i)
    $cum_prob += $probabilities[$i];
    if($assignment_index < $cum_prob)
        set_branch($customer_id, $branches[$i]);
        return $branches[$i];}}

 

This kind of rollout code has a stealth side benefit, which is that it lets you overcome randomness in unit and functional tests. Since you’re a good programmer, you write tests, of course! Well, what happens when there’s an active experiment in the codebase that the test you’re writing doesn’t care about? If your test selects customers into an experiment several levels deep in a way that’s hidden to you, there’s randomness in your test. I don’t think I have to explain why that’s bad. Fortunately, with rollout rolled into experiments, there’s a simple solution. You just set all rollout percentages to 0 at the beginning of each test.

Now, as promised, here is some additional reading, also by people from IMVU who have seen what’s possible from up close.


Timothy Fitz on continuous integration:

 

http://timothyfitz.wordpress.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/

http://timothyfitz.wordpress.com/2009/02/08/continuous-deployment/

Eric Ries on A/B testing:

http://startuplessonslearned.blogspot.com/search/label/split-test

3 Responses to A/B Split Testing or Avoiding The Hippo

  1. […] the code.  These dependencies can be indirect and difficult to detect.  A common cause of this is A/B testing.  This is simply inexcusable in tests.  Find and mock out every random function call, so that […]

  2. cpanel vps says:

    Hello, Neat post. There is a problem together with your site in web explorer, may check this? IE still is the marketplace leader and a large section of other folks will leave out your great writing because of this problem.|

    • joblivious says:

      cross-browser problems are the fault of wordpress, which did all the web dev you see. If this problem is still here, you should tell them :) thanks for the vigilence!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: