Advanced TDD: Two Birds With One Stone

To read this article, you need to know what TDD is, and also what A/B Experiments are. Some people call A/B experiments “split tests” or “A/B tests,” but I’ve avoided that terminology here because it would overload the word “test.” When I say “test” I’m talking about automated test, as in TDD. When I say “experiment,” I’m talking about an A/B test.

Also, I apologize if this post is pretty dense. It’s deep subject matter. I tried to balance verbosity with clarity, but I might not have gotten it right.

When you combine the agile practices of TDD, A/B experiments, and rollout, two things will go wrong; one in tests, and another in production.

Problem one is non-deterministic tests. When you have a live experiment in your sandbox, it puts randomness in your tests. Experiments work by randomly assigning people to branches; this randomness affects your software’s behavior. Software that randomly does different things is intermittent software, which means it has intermittent tests. The tests that directly assert on the behavior of the experiment can be written such that fake customers are pre-assigned to branches of the experiment, but this won’t scale. An experiment on a low enough level of a large enough codebase will be touched other tests which then inherit this randomness.

Problem two occurs not in tests but in production. Strictly speaking, it’s caused by temptation, and can be avoided by other means. I mention it because the solution I’ll propose to problem one will also solve this problem.

If the probabilities with which your experiment assigns behavior to customers can be changed by administrators, this appears to provide a convenient roll out mechanism. You’d like push the experiment live with 0% of users getting the experimental behavior, then over the course of a few weeks crank it up to 50%, if it’s not causing any major problems. The problem with this is subtle. Recall that once a customer has been assigned to a branch of an experiment, he or she is stuck there forever.

That means that during the early stages of the experiment, when the hypothesis branch is set with very low probability, most of your existing customers get assigned to the control branch forever. Then, as you turn up the flow into the hypothesis branch, it gets biased towards new users. You can’t use experiments to incrementally roll out new features without biasing their results. But you do want to roll out incrementally. So how do you have both?

The solution to both problems is to glue the experiment system to a dedicated roll out system. For each experiment, have a separate, automatically generated roll out percentage. This percentage can be set either through your experiment API or by an admin with a configuration script. Have these roll out percentages start at 0% when an experiment is set up. When a customer is about to be selected into one of the branches of the experiment, and they’re not already in one of the branches, first check whether a hash of their customer ID mod 100 is less than the roll out percentage. If it’s not, skip experiment selection and give them the control behavior. Otherwise, perform the random selection as normal, and start giving that customer their randomly-selected behavior. Then, do not ever touch roll out percentages programmatically in your code. Only allow tests and the admin script do do it.

At the beginning of all tests, have a setup script that tears down and deletes all experiments, allowing them to be set up anew whenever their code branches are hit. Now, in tests that touch experiments, those experiments will be spawned at 0% roll out – and be deterministic. In production, existing users don’t have to be assigned to the control branch permanently during roll out; you’ll be back to an unbiased one-variable experiment.

Advertisements

One Response to Advanced TDD: Two Birds With One Stone

  1. […] Split Tests are a very powerful way to determine if a thing you are doing is in your best interest or not. If you aren’t familiar with A/B tests, the very high level is that you show half of your users one thing, half another thing, and you compare the results of each set of users. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: