Best Practices for A/B Tests

Note: Sign in to the developer console to build or publish your skill.

Use the following best practices to help you design and maintain A/B tests. Before you get started, make sure you understand the different Configuration Attributes and Metrics you can use in a test.

A/B testing stages

A/B tests have the following three stages:

Design – Your test isn't running yet. In this phase you define your hypothesis, key metrics, guardrail metrics, control experiences, and treatment experiences.
In-flight – Your test is running. In this phase, a previously defined percentage of your customers receive the treatment version of your test.
Analysis – Your test is over. In this phase you analyze the results you collected and make an informed decision on which skill version to use going forward, either your control version or treatment version.

Frequently Asked Questions

Design best practices

What concepts should I address when designing an A/B Test?

You should address the following concepts.

Key metrics – A set of metrics which are best suited to evaluate the hypothesis you're testing in your treatment version.
Experiment-specific guardrail metrics – A set of metrics formally included in your test design to detect unintended degradation of your skill or your customer experience.
Launch criteria – A set of changes (if satisfied) that you use to validate that launching your treatment version to all customers is the correct choice. Note: There isn't a dedicated field for these values in A/B Testing design schema. You must define and track these values yourself.
Design duration: – The amount of users you must include in your test to reach a desired power of your key metrics, for example, your sample size. Note: There isn't a dedicated field for these values in A/B Testing design schema. You must define and track these values yourself.

What's a hypothesis?

A hypothesis is one of the most important parts of your A/B test. Your hypothesis is a claim that you expect to evaluate with your experimental content. An example hypothesis could be "Changing the content in my upsell messaging will drive more sales." By stating and seeking evidence for or against the hypothesis, you can create outcomes that you apply to products beyond those under experimentation.

What's a null hypothesis?

A null hypothesis is the default position that there is no difference between the measured response of your two test groups (control and treatment).

What are key metrics?

Key metrics are the most suited (sensitive) metrics expected to provide unambiguous evidence of the intended changes to customer experience caused by the treatment; key metrics should measure the intended consequences of a test. You should determine your key metrics in the design stage of an A/B Test. Before starting a test, decide which metrics play an important role in determining whether you launch your treatment version to all users (or not). (and, therefore, will become the new control).

What are guardrail metrics?

A/B test-specific guardrail metrics are the most suited (sensitive) guardrail metrics expected to detect unintended customer experience degradation caused by the treatment; guardrail metrics are your best guess at contrary behavior that might inadvertently be caused by the treatment version. You should define your experiment-specific guardrail metrics n the design stage of an A/B Test. Before starting a test, decide which metrics play an important role in determining whether you launch your treatment version to all users (or not).

How do I select key metrics for my A/B tests?

You should select one to three key metrics which track changes in your customer behavior, as they relate to your hypothesis. For example, if your hypothesis states that you can increase customer subscriptions by changing the location of your ISP upsell messaging, than you can should select ISP accept and ISP conversion metrics as your key metrics.

What's a false positive result?

A false positive is the rejection of a true null hypothesis. It's otherwise known as a type I error.

Why do I only choose one to three key metrics?

Choosing one to three key metrics reduces your false positive rate, which can occur if you test multiple hypotheses. Conducting multiple hypothesis tests on the same data leads to higher false positive rates because you have increased the tests but your independent data points haven't changed.

How do I select guardrail metrics?

You should select guardrail metrics that manage any unintended negative consequences to your customer base. Your guardrail metrics should cover all your customer behavior categories including engagement, retention, and monetization. For example, if you set a hypothesis to test customer retention, you can use guard rail metrics to make sure that your dialogs don't drop, your friction doesn't increase, or your skill monetization doesn't decrease.

In-Flight best practices

What should I monitor when my A/B test is in-flight?

After you launch your test, you should monitor your skill to make sure you haven't introduced any unintended bugs, errors, outages or throttling. You should also monitor your guardrail metrics to make sure your treatment experience doesn't result in a significant degradation to your skill experience. You should also closely monitor the performance of your skill for at least the first three days of your test.

How long should I run my A/B test for?

You should run an A/B test in a set number of weeks, up to a maximum of four weeks, for example 7 days, 14 days, 21 days, or the maximum value of 28 days. Specifying a multiple of 7 days means that the test stops at the end of a week, which helps prevent terminating or altering your test the moment one of your key metrics reaches statistical significance because of a day-of-the-week effect. If an experiment stops in the middle of the week, the collected data might be biased and might fail to reflect the actual behavior. Likewise, you shouldn't extend a test for the sake of meeting some predetermined launch criteria.

What should I do if I find a bug in my code while my A/B test is running?

Complete the following actions if you find a bug in your code.

Disable the A/B test immediately to avoid any negative customer impact.
Fix the code and push the changes to production.
Re-configure your triggers and launch a new A/B Test (when ready). Any data collected prior to the code change shouldn't be used for making a launch decision.

I received a Treatment Allocation Alarm (TAA) during my A/B test, what should I do?

For more details, see Issue: I received a Treatment Allocation Alarm (TAA) during my A/B test.

How do I determine if a TAA alarm is a false positive?

All statistical tests have a probability of producing a false positive result, for example, the Type I error rate. There is no absolute certainty with statistical tests.

What are the impacts of submitting a skill while the A/B test is running?

Changes in skill resources in a new published version of the skill, for example changes to utterances and intents, can change the metrics for the A/B test. Changes to the treatment experience in the endpoint code can also change the metrics for the test.

Analysis best practices

What's a P-Value?

The p-value is the probability of seeing a particular result (or more extreme) from zero, assuming that the null hypothesis is TRUE.

What's a confidence interval?

A confidence interval is one way of presenting the uncertainty associated with a given measurement of a parameter of interest.

What's the Average Treatment Effect (ATE)?

The average treatment effect (ATE) is measured as the difference between the mean of the treatment group and the control group, that is, ATE = mean (treatment) - mean (control).

What should I consider when analyzing an A/B test?

You should review the following items for each key metric that has a statistically significant Average Treatment Effect (ATE).

The 95% confidence interval of the relative change.
The p-value of the obtained result.
(Optional) The percent difference of the ATE.

Additionally, guardrail metrics shouldn't degrade your skill experience. For example, you might write your test summary similar to the following statement.

With respect to the active days metric (the key metric), the 95% confidence interval of the relative percent difference increase is (0.37%, 1.44%) with p-value = 0.001. Hence, there is very strong evidence that the number of active days in the treatment version has increased. Furthermore, there is no evidence of degradation of the skill experience with respect to the guardrail metrics.

What should I avoid when analyzing an A/B test?

For best results, avoid these common analysis behaviors.

Don't analyze your metrics at different time periods. Instead, analyze all metrics within the same time and duration of a test.
Don't use guardrail metrics as the sole reason for publishing a test's treatment version. You should only use guardrail metrics to flag cases where there is strong evidence to avoid launching a test due to unintended consequences.
Don't report metrics aren't a part of your initial launch criteria. Your decision should focus on your intended changes in your key metrics.

What are some decisions I can make at the end of an A/B test?

After you finish an A/B test, you might want to complete the following actions.

Launch criteria evaluation – Decide if your test was successful and if you validated your hypothesis.
Publishing review – Decide if you should publish your test's treatment version. You should also consider factors outside of your test. For example, evaluating your test might suggest that your treatment version outperformed your control version, but the cost of launching this new version temporarily prevents publishing.
Investigation into the results – Deep dive into your test results to determine why there was a deterioration in key or guard rail metrics. When you discover your problems, you can make the appropriate adjustments to your test's treatment version and re-run the test.
Re-run of the test – Decide if you should re-run the test again. You might do this if you noticed a flaw or bias in your test.
Abandon – Decide if you should abandon the test. You might do this if you determined the results of your treatment version aren't beneficial to your skill.
Customer traffic increase – Decide if you should increase your sample size. You might do this if your sample size isn't large enough to yield a statistically significant result. At this point, you can re-run the test with more customer traffic allocated to the treatment version.

Was this page helpful?

Provide feedback

Last updated: Oct 13, 2023