Plan an experiment
- 1. Plan what you want to vary
- 2. Plan what you want to measure
- 3. Design the experiment
- 4. Choose traffic weights
- 5. Calculate sample sizes
- 6. Choose an implementation approach
- Next step
From a planning standpoint, an experiment has these parts:
A baseline control – One of the experiment variants will be the control. This is "how we are doing things today." If you are experimenting from the start, choose the simplest variant as the control.
Experiment variants – Experiment variants other than the control are attempts to improve the user’s extended search experience. Which relevancy strategy works best for your search app and your users?
Metrics – This is how you know whether the search variants produce differences in user interactions, and whether the differences are statistically significant.
In the remainder of this topic, you’ll make decisions about these broad areas, as well as about experiment details.
1. Plan what you want to vary
Identify different relevancy strategies, where each represents a hypothesis about which user experience will drive more click-throughs, purchases, and so on. Use the Query Workbench to explore how to produce different search results and recommendations using different query pipelines, and evaluate which ones might engage your users most effectively.
2. Plan what you want to measure
Metrics compare the control against other variants pairwise. For example, if the variants are
D, and you choose
experiment as the control, then the comparisons for which metrics are generated will be
You can learn more about metrics.
3. Design the experiment
When designing an experiment, you must make these decisions:
How users are identified
Percentage of total traffic to send through the experiment
Number of variants and how they differ
Metrics to generate
In many cases identifying users is straightforward, using an existing user ID or session ID if the application has one. In other cases, you may need to generate an identifier of some sort to send in on queries. It is important to send in some kind of identifier with each query so that the experiment can route the query to a variant, and to send that same identifier with any subsequent signals that resulted from that query. Queries without a user ID will not be routed through the experiment.
The percentage of total traffic to send through the experiment is the one variable that can change over the course of the experiment. It is often a good practice to start out sending only a small percentage of search traffic through a new experiment, in order to verify that each of the variants are functioning properly. Then, once you have established that the behavior is as intended, you can increase the percentage of traffic through the experiment to the desired level.
With modest usage and for a possibly small effect, or when testing multiple variants at the same time, you might want to send 100% of users through the experiment and let it run longer. For high usage and an effect that is expected to be larger, and with only two variants, you might not need to send all users through the experiment and the experiment won’t take as long.
4. Choose traffic weights
Fusion AI uses traffic weights to apportion search traffic among the variants. This allows you to send a different percentage of traffic through each variant if desired.
4.1. Automatic traffic weights (multi-armed bandit)
The Automatically Adjust Weights Between Variants configuration option enables multi-armed bandits and eliminates the need to specify a traffic weight for each variant.
In multi-arm bandit mode, metrics jobs are created automatically once the experiment starts. However, the jobs must be scheduled manually. It is recommended that you schedule the metrics jobs to run on an hourly basis. The weights between variants only change after the metrics jobs run.
Fusion’s multi-arm bandit implementation uses a variation of Thompson Sampling (sometimes called Bayesian Bandits). This algorithm uses the current count of successes versus failures to build a beta distribution that represents the level of confidence in the primary metric value for each variant. It then samples a random number from each variant’s distribution, and picks the highest number.
This type of implementation has three effects:
It weights better-performing variants higher.
Since the beta distribution of each variant is centered around the primary metric value for that variant, a random number selected from a higher-performing variant is likely to be higher than a random number picked from a lower-performing variant.
Lower-performing variants remain in play
Picking a random number from each distribution preserves the chance that Fusion will try a lower-performing variant, as long as there is still a chance that it is better.
The more confident the measurements, the narrower the beta distributions become.
The more uncertain the measurements, the wider the distributions will be, and thus the more likely that Fusion will choose variants that appear to be performing more poorly.
Since Fusion adjusts the weights between variants each time the metrics jobs run, users can still get different results on subsequent visits. For example, if variant A is getting 80% of traffic, but after recalculating metrics it is only getting 50% of traffic, then some users who were previously assigned to variant A will be assigned to variant B. However, only the bare minimum of users will be reassigned to a new variant. Most users will see no changes. Once the experiment has been running for some time, the changes between the variants should be fairly small, so relatively few users should be affected.
4.2. Manually specifying traffic weights
The formula for variant A is:
ProportionA = (Traffic weightA)/(Sum of traffic weights for all variants)
|Variant traffic weights||Sum of traffic weights||Variant proportions|
1.0 1.0 2.0
0.25 0.25 0.5
0.5 1.0 1.0 2.5
0.1 0.2 0.2 0.5
5. Calculate sample sizes
Fusion will calculate the required sample size to detect a statistically significant result based on the results at runtime. The "confidence level" metric that is displayed in App Insights has this minimum sample size factored in, so that confidence is always low for experiments that have not yet reached their required sample size.
However, if you would like to use different power or significance level in evaluating your experiment (Fusion will use 0.08 and 0.05), or if you would like to establish your own sample size based on a desired minimum detectable effect, you may do so.
6. Choose an implementation approach
You can construct an experiment in either of two ways:
Experiment and query profile (recommended) – For most cases, you’ll want to create additional query pipelines that return different search results. A query profile directs traffic through the query pipelines in accordance with the traffic weights of experiment variants.
Experiment stage in a query pipeline – If you want to use parts of a single query pipeline in all experiment variants, you can add an Experiment stage to that pipeline (the pipeline that receives search queries). The app can direct queries to the endpoint of a query profile that references the pipeline (recommended) or to the endpoint of the query pipeline. If used, the query profile doesn’t reference an experiment.