Plan an experiment
- 1. Plan what you want to vary
- 2. Plan what you want to measure
- 3. Design the experiment
- 4. Choose traffic weights
- 5. Calculate sample sizes
- 6. Choose an implementation approach
- Next step
From a planning standpoint, an experiment has these parts:
A baseline control – One of the experiment variants will be the control. This is "how we are doing things today." If you are experimenting from the start, choose the simplest variant as the control.
Experiment variants – Experiment variants other than the control are attempts to improve the user’s extended search experience. Which relevancy strategy works best for your search app and your users?
Metrics – This is how you know whether the search variants produce differences in user interactions, and whether the differences are statistically significant.
In the remainder of this topic, you’ll make decisions about these broad areas, as well as about experiment details.
1. Plan what you want to vary
Identify different relevancy strategies, where each represents a hypothesis about which user experience will drive more click-throughs, purchases, and so on. Use the Query Workbench to explore how to produce different search results and recommendations using different query pipelines, and evaluate which ones might engage your users most effectively.
2. Plan what you want to measure
Metrics compare the control against other variants pairwise. For example, if the variants are
D, and you choose
experiment as the control, then the comparisons for which metrics are generated will be
You can learn more about metrics.
3. Design the experiment
When designing an experiment, you must make these decisions:
How users are identified
Percentage of total traffic to send through the experiment
Number of variants and how they differ
Metrics to generate
In many cases identifying users is straightforward, using an existing user ID or session ID if the application has one. In other cases, you may need to generate an identifier of some sort to send in on queries. It is important to send in some kind of identifier with each query so that the experiment can route the query to a variant, and to send that same identifier with any subsequent signals that resulted from that query. Queries without a user ID will not be routed through the experiment.
The percentage of total traffic to send through the experiment is the one variable that can change over the course of the experiment. It is often a good practice to start out sending only a small percentage of search traffic through a new experiment, in order to verify that each of the variants are functioning properly. Then, once you have established that the behavior is as intended, you can increase the percentage of traffic through the experiment to the desired level.
With modest usage and for a possibly small effect, or when testing multiple variants at the same time, you might want to send 100% of users through the experiment and let it run longer. For high usage and an effect that is expected to be larger, and with only two variants, you might not need to send all users through the experiment and the experiment won’t take as long.
4. Choose traffic weights
Fusion AI uses traffic weights to apportion search traffic among the variants. This allows you to send a different percentage of traffic through each variant if desired.
The formula for variant A is:
ProportionA = (Traffic weightA)/(Sum of traffic weights for all variants)
|Variant traffic weights||Sum of traffic weights||Variant proportions|
1.0 1.0 2.0
0.25 0.25 0.5
0.5 1.0 1.0 2.5
0.1 0.2 0.2 0.5
5. Calculate sample sizes
Fusion will calculate the required sample size to detect a statistically significant result based on the results at runtime. The "confidence level" metric that is displayed in App Insights has this minimum sample size factored in, so that confidence is always low for experiments that have not yet reached their required sample size.
However, if you would like to use different power or significance level in evaluating your experiment (Fusion will use 0.08 and 0.05), or if you would like to establish your own sample size based on a desired minimum detectable effect, you may do so.
6. Choose an implementation approach
You can construct an experiment in either of two ways:
Experiment and query profile (recommended) – For most cases, you’ll want to create additional query pipelines that return different search results. A query profile directs traffic through the query pipelines in accordance with the traffic weights of experiment variants.
Experiment stage in a query pipeline – If you want to use parts of a single query pipeline in all experiment variants, you can add an Experiment stage to that pipeline (the pipeline that receives search queries). The app can direct queries to the endpoint of a query profile that references the pipeline (recommended) or to the endpoint of the query pipeline. If used, the query profile doesn’t reference an experiment.