Analyze Experiment Results
After you have run an experiment in Fusion, you can analyze the results. When you stop an experiment, Fusion runs jobs that calculate metrics for the data that were collected. All jobs associated with an experiment are prefixed with the name of the experiment, that is,
<experiment-name>-<metric-name>. For the Query Relevance metric, there are two jobs:
You can also have Fusion generate metrics while an experiment is still running, by running metrics jobs by hand.
When you activate an experiment, Fusion AI schedules metrics jobs for the experiment. These are the default schedules for metrics jobs:
Ground Truth (used for the Query Relevance metric):
First run – Not scheduled. The first time, you must run the Ground Truth job by hand.
Subsequent runs – Every month until the experiment is stopped (by default; you can specify a different schedule)
All other metrics jobs:
First run – 20 minutes after the experiment starts
Subsequent runs – Every 24 hours until the experiment is stopped (by default; you can specify a different schedule)
Last run – Immediately after the experiment is stopped
You can modify the default schedule as follows:
Navigate to the experiment: Analytics > Experiments.
Next to each metric, find the Processing Schedule link. This link is active even if the experiment is running.
Edit the schedule as desired.
Periodic runs of metrics jobs are intended to give you up-to-date metrics. The metrics are always calculated from the beginning of the experiment.
|Even with periodically updated metrics, we recommend that you let an experiment run its course before drawing conclusions and taking action.|
When you view experiment metrics and statistics in App Insights, that information reflects the experiment’s state the last time the metrics jobs ran. When you stop an experiment, it is especially important that you verify that the end-of-experiment metrics jobs have run.
Navigate to Collections > Jobs.
In the Filter field, enter the experiment name.
This displays only the experiment jobs.
Examine the Last run value below each job name.
After metrics jobs run, you can view the metrics that they have produced in App Insights. For more information about the metrics, read this topic.
Statistical significance calculations inform you whether differences among experiment variants are likely to result from random chance, as opposed to real causal effects.
Fusion AI provides two measures of statistical significance:
Confidence index – The confidence index expresses the confidence that the experiment results are statistically significant. It takes into account the current sample size of the experiment, the required sample size to accurately establish statistical significance, as well as the calculated p-value.
The confidence index expresses the confidence that the experiment results are statistically significant. It gives you a gauge for whether the differences between variants are due to a causal effect (as opposed to random chance). The confidence index combines two concepts: the minimum sample size, and the p-value. If the number of samples is lower than the minimum sample size, then the confidence index is based entirely on the percentage of sample size. If the number of samples is above the minimum sample size, then the confidence index directly related to the p-value generated using Welch’s t-test Welch’s t-test, which is a variation of the Student’s t-test. Welch’s t-test is better than the Student’s t-test when samples have unequal variances and/or sample sizes.
The test is a pairwise test, with each comparison being two-tailed (there is no a priori assumption that the difference will be in a specific direction). Fusion AI compares each variant against the first variant (the control), and generates a p-value for the comparison. The confidence index score is based on the lowest p-value amongst the variants.
The confidence index is this, rounded to the nearest whole number:
CI = 100 * (1-p)
You can recover two digits of the p-value from the confidence index as follows:
p = 1 - CI/100
The percent chance of beating uses a Bayesian algorithm to calculate the percent chance that another variant than the control does better than the control.
When calculating the percent chance of beating, Fusion AI uses up to 30 days of historical signal data to establish a baseline to compare against. The baseline is useful but not required. If the historical data is available, an experiment can reach higher confidence numbers more quickly.
Fusion AI calculates historical metrics one time and stores them, so subsequent runs of the metrics calculation jobs will not need to recalculate them.
At the moment, percent chance of beating is only accessible through the Fusion AI API, not through App Insights. Use the metrics endpoint
Note the following best practices regarding statistical significance:
If you peek, do not act – P-values only reach significant levels when there is enough data. This leads to the problem of peeking (when people look at experiment results too early and make incorrect decisions). Wait until an experiment is over before making decisions based on the experiment. The confidence index is intended to encourage this practice.
Do not modify running experiments – To modify an experiment, you have to stop it, and data collection for the experiment stops. This is nice and clean and as it should be. You could, however, modify some object that the experiment uses (for example, you could modify a query pipeline) while the experiment is running. But this makes it unclear what you have been testing. We recommend against this practice. Instead, stop the first experiment, make the modifications, and then activate (start) an experiment that uses the modified object.