Plan an experiment
From a planning standpoint, an experiment has these parts:
-
A baseline control. One of the experiment variants will be the control. This is "how we are doing things today." If you are experimenting from the start, choose the simplest variant as the control.
-
Experiment variants. Experiment variants other than the control are attempts to improve the user’s extended search experience. Which relevancy strategy works best for your search app and your users?
-
Metrics. This is how you know whether the search variants produce differences in user interactions, and whether the differences are statistically significant.
In the remainder of this topic, you will make decisions about these broad areas, as well as about experiment details.
1. Plan what you want to vary
Identify different relevancy strategies, where each represents a hypothesis about which user experience will drive more click-throughs, purchases, and so on. Use the Query Workbench to explore how to produce different search results and recommendations using different query pipelines, and evaluate which ones might engage your users most effectively.
2. Plan what you want to measure
Metrics compare the control against other variants pairwise. For example, if the variants are experiment
, B
, C
, and D
, and you choose experiment
as the control, then the comparisons for which metrics are generated will be experiment/B
, experiment/C
, and experiment/D
.
For more information, see experiment metrics.
3. Design the experiment
When designing an experiment, you must make these decisions:
-
How users are identified
-
Percentage of total traffic to send through the experiment
-
Number of variants and how they differ
-
Metrics to generate
In many cases identifying users is straightforward, using an existing user ID or session ID if the application has one. In other cases, you may need to generate an identifier of some sort to send in on queries. It is important to send in some kind of identifier with each query so that the experiment can route the query to a variant, and to send that same identifier with any subsequent signals that resulted from that query. Queries without a user ID will not be routed through the experiment.
The percentage of total traffic to send through the experiment is the one variable that can change over the course of the experiment. It is often a good practice to start out sending only a small percentage of search traffic through a new experiment, in order to verify that each of the variants are functioning properly. Then, once you have established that the behavior is as intended, you can increase the percentage of traffic through the experiment to the desired level.
With modest usage and for a possibly small effect, or when testing multiple variants at the same time, you might want to send 100% of users through the experiment and let it run longer. For high usage and an effect that is expected to be larger, and with only two variants, you might not need to send all users through the experiment and the experiment will not take as long.
4. Choose traffic weights
Managed Fusion uses traffic weights to apportion search traffic among the variants. This allows you to send a different percentage of traffic through each variant if desired.
4.1. Manually specifying traffic weights
The formula for variant A is:
ProportionA = (Traffic weightA)/(Sum of traffic weights for all variants)
For example:
Variant traffic weights | Sum of traffic weights | Variant proportions |
---|---|---|
1.0 1.0 |
2 |
0.5 0.5 |
1.0 1.0 2.0 |
4 |
0.25 0.25 0.5 |
0.5 1.0 1.0 2.5 |
5 |
0.1 0.2 0.2 0.5 |
5. Calculate sample sizes
Managed Fusion calculates the required sample size to detect a statistically significant result based on the results at runtime. The "confidence level" metric has this minimum sample size factored in, so that confidence is always low for experiments that have not yet reached their required sample size.
However, if you would like to use different power or significance level in evaluating your experiment (Managed Fusion uses 0.08 and 0.05), or if you would like to establish your own sample size based on a desired minimum detectable effect, you may do so.
6. Choose an implementation approach
You can construct an experiment in either of two ways:
-
Experiment and query profile. (recommended) For most cases, you will want to create additional query pipelines that return different search results. A query profile directs traffic through the query pipelines in accordance with the traffic weights of experiment variants.
-
Experiment stage in a query pipeline. If you want to use parts of a single query pipeline in all experiment variants, you can add an Experiment stage to that pipeline (the pipeline that receives search queries). The app can direct queries to the endpoint of a query profile that references the pipeline (recommended) or to the endpoint of the query pipeline. If used, the query profile does not reference an experiment.
Next step
You have planned the experiment. Next, you will set it up using either a query profile or an Experiment stage. This guide includes both options.