Plan an experiment

From a planning standpoint, an experiment has these parts:

  • A baseline control – One of the experiment variants will be the control. This is "how we are doing things today." If you are experimenting from the start, choose the simplest variant as the control.

  • Experiment variants – Experiment variants other than the control are attempts to improve the user’s extended search experience. Which relevancy strategy works best for your search app and your users?

  • Metrics – This is how you know whether the search variants produce differences in user interactions, and whether the differences are statistically significant.

In the remainder of this topic, you’ll make decisions about these broad areas, as well as about experiment details.

1. Plan what you want to vary

Identify different relevancy strategies, where each represents a hypothesis about which user experience will drive more click-throughs, purchases, and so on. Use the Query Workbench to explore how to produce different search results and recommendations using different query pipelines, and evaluate which ones might engage your users most effectively.

2. Plan what you want to measure

Metrics compare the control against other variants pairwise. For example, if the variants are experiment, B, C, and D, and you choose experiment as the control, then the comparisons for which metrics are generated will be experiment/B, experiment/C, and experiment/D.

3. Design the experiment

When designing an experiment, you must make these decisions:

  • How users are identified

  • Percentage of total traffic to send through the experiment

  • Number of variants and how they differ

  • Metrics to generate

In many cases identifying users is straightforward, using an existing user ID or session ID if the application has one. In other cases, you may need to generate an identifier of some sort to send in on queries. It is important to send in some kind of identifier with each query so that the experiment can route the query to a variant, and to send that same identifier with any subsequent signals that resulted from that query. Queries without a user ID will not be routed through the experiment.

The percentage of total traffic to send through the experiment is the one variable that can change over the course of the experiment. It is often a good practice to start out sending only a small percentage of search traffic through a new experiment, in order to verify that each of the variants are functioning properly. Then, once you have established that the behavior is as intended, you can increase the percentage of traffic through the experiment to the desired level.

With modest usage and for a possibly small effect, or when testing multiple variants at the same time, you might want to send 100% of users through the experiment and let it run longer. For high usage and an effect that is expected to be larger, and with only two variants, you might not need to send all users through the experiment and the experiment won’t take as long.

4. Choose traffic weights

Fusion AI uses traffic weights to apportion search traffic among the variants. This allows you to send a different percentage of traffic through each variant if desired.

4.1. Automatic traffic weights (multi-armed bandit)

The Automatically Adjust Weights Between Variants configuration option enables multi-armed bandits and eliminates the need to specify a traffic weight for each variant.

In multi-arm bandit mode, metrics jobs are created and scheduled automatically once the experiment starts. The weights between variants only change after the metrics jobs run.

Fusion’s multi-arm bandit implementation uses a variation of Thompson Sampling (sometimes called Bayesian Bandits). This algorithm uses the current count of successes versus failures to build a beta distribution that represents the level of confidence in the primary metric value for each variant. It then samples a random number from each variant’s distribution, and picks the highest number.

This type of implementation has three effects:

  • It weights better-performing variants higher.

    Since the beta distribution of each variant is centered around the primary metric value for that variant, a random number selected from a higher-performing variant is likely to be higher than a random number picked from a lower-performing variant.

  • Lower-performing variants remain in play

    Picking a random number from each distribution preserves the chance that Fusion will try a lower-performing variant, as long as there is still a chance that it is better.

  • The more confident the measurements, the narrower the beta distributions become.

    The more uncertain the measurements, the wider the distributions will be, and thus the more likely that Fusion will choose variants that appear to be performing more poorly.

Since Fusion adjusts the weights between variants each time the metrics jobs run, users can still get different results on subsequent visits. For example, if variant A is getting 80% of traffic, but after recalculating metrics it is only getting 50% of traffic, then some users who were previously assigned to variant A will be assigned to variant B. However, only the bare minimum of users will be reassigned to a new variant. Most users will see no changes. Once the experiment has been running for some time, the changes between the variants should be fairly small, so relatively few users should be affected.

4.2. Manually specifying traffic weights

The formula for variant A is:

ProportionA = (Traffic weightA)/(Sum of traffic weights for all variants)

For example:

Variant traffic weights Sum of traffic weights Variant proportions

1.0 1.0

2

0.5 0.5

1.0 1.0 2.0

4

0.25 0.25 0.5

0.5 1.0 1.0 2.5

5

0.1 0.2 0.2 0.5

5. Calculate sample sizes

Fusion will calculate the required sample size to detect a statistically significant result based on the results at runtime. The "confidence level" metric that is displayed in App Insights has this minimum sample size factored in, so that confidence is always low for experiments that have not yet reached their required sample size.

However, if you would like to use different power or significance level in evaluating your experiment (Fusion will use 0.08 and 0.05), or if you would like to establish your own sample size based on a desired minimum detectable effect, you may do so.

6. Choose an implementation approach

You can construct an experiment in either of two ways:

  • Experiment and query profile (recommended) – For most cases, you’ll want to create additional query pipelines that return different search results. A query profile directs traffic through the query pipelines in accordance with the traffic weights of experiment variants.

  • Experiment stage in a query pipeline – If you want to use parts of a single query pipeline in all experiment variants, you can add an Experiment stage to that pipeline (the pipeline that receives search queries). The app can direct queries to the endpoint of a query profile that references the pipeline (recommended) or to the endpoint of the query pipeline. If used, the query profile doesn’t reference an experiment.

Next step

You’ve planned the experiment. Next, you will set it up using either a query profile or an Experiment stage.