Analyze experiment results
After you have run an experiment in Managed Fusion, you can analyze the results. When you stop an experiment, Managed Fusion runs jobs that calculate metrics for the data collected. Jobs associated with an experiment are prefixed with the name of the experiment in the following format:
<EXPERIMENT-NAME>-<METRIC-NAME>
There are two jobs for the Query Relevance metric:
-
<EXPERIMENT-NAME>-groundTruth-<METRIC-NAME>
-
<EXPERIMENT-NAME>-rankingMetrics-<METRIC-NAME>
If you run metrics jobs manually, Managed Fusion generates experiment metrics while the experiment is running.
Default schedules for metrics jobs
When you activate an experiment, Managed Fusion schedules metrics jobs for the experiment.
Ground truth metric job
Ground truth is used for the Query Relevance metric.
-
First run. Must be run manually and cannot be scheduled.
-
Subsequent runs. By default, the experiment runs every month until the experiment is stopped. You can specify a different schedule.
Other metrics jobs
The run schedules for all other metrics jobs are:
-
First run. Occurs 20 minutes after the experiment starts.
-
Subsequent runs. By default, the experiment runs every 24 hours until the experiment is stopped. You can specify a different schedule.
-
Last run. Occurs immediately after the experiment is stopped
Modify metrics jobs schedules
To modify the default schedule for metrics jobs, complete the following:
-
Sign in to Managed Fusion and click your application.
-
Click Analytics Hub > Experiments.
-
In the metric to edit, click Processing Schedule. This link is active even if the experiment is running.
-
Edit the schedule as desired.
-
Click Save.
Periodic runs of metrics jobs are intended to give you up-to-date metrics. The metrics are always calculated from the beginning of the experiment.
Even with periodically updated metrics, Lucidworks recommends you let an experiment run its course before drawing conclusions and taking action. |
Check the last run time for metrics jobs
When you view experiment metrics and statistics, that information reflects the experiment’s state the last time the metrics jobs ran. When you stop an experiment, it is especially important that you verify that the end-of-experiment metrics jobs have run.
To check the last run time:
-
Sign in to Managed Fusion and click your application.
-
Click Collections > Jobs.
-
In the Filter field, enter the experiment name. The Last run value displays for the experiment.
App Insights metrics
Analytics produced by metrics jobs are described in App Insights.
Statistical significance
Statistical significance calculations inform you whether differences among experiment variants are likely to result from random chance, as opposed to real causal effects.
Managed Fusion provides two measures of statistical significance:
-
Confidence index. The confidence index expresses the confidence that the experiment results are statistically significant. It takes into account the current sample size of the experiment, the required sample size to accurately establish statistical significance, and the calculated p-value.
-
Percent chance of beating. The percent chance of beating uses a Bayesian algorithm to calculate the percent chance that another variant performs better than the control.
Confidence index
The confidence index expresses the confidence that the experiment results are statistically significant. It gives you a gauge for whether the differences between variants are due to a causal effect as opposed to random chance.
The confidence index combines two concepts: the minimum sample size, and the p-value.
-
If the number of samples is lower than the minimum sample size, then the confidence index is based entirely on the percentage of sample size.
-
If the number of samples is above the minimum sample size, then the confidence index directly related to the p-value generated using Welch’s t-test Welch’s t-test, which is a variation of the Student’s t-test. Welch’s t-test is better than the Student’s t-test when samples have unequal variances and/or sample sizes.
The test is a pairwise test, with each comparison being two-tailed (there is no a priori assumption that the difference will be in a specific direction). Managed Fusion compares each variant against the first variant (the control), and generates a p-value for the comparison. The confidence index score is based on the lowest p-value amongst the variants.
The confidence index is this, rounded to the nearest whole number:
CI = 100 * (1-p)
You can recover two digits of the p-value from the confidence index as follows:
p = 1 - CI/100
Percent chance of beating
The percent chance of beating uses a Bayesian algorithm to calculate the percent chance that another variant than the control does better than the control.
When calculating the percent chance of beating, Managed Fusion uses up to 30 days of historical signal data to establish a baseline to compare against. The baseline is useful but not required. If the historical data is available, an experiment can reach higher confidence numbers more quickly.
Managed Fusion calculates historical metrics one time and stores them, so subsequent runs of the metrics calculation jobs will not need to recalculate them.
Percent chance of beating is only accessible through the Managed Fusion API, not through App Insights. Use the metrics endpoint https://EXAMPLE_COMPANY.b.lucidworks.cloud:<api-port>/api/experiments/experiment-name/metrics , where the API port is 6764 .
|
Best practices
Note the following best practices regarding statistical significance:
-
If you peek, do not act. P-values only reach significant levels when there is enough data. This leads to the problem of peeking (when people look at experiment results too early and make incorrect decisions). Wait until an experiment is over before making decisions based on the experiment. The confidence index is intended to encourage this practice.
-
Do not modify running experiments. To modify an experiment, you have to stop it, and data collection for the experiment stops. This is nice and clean and as it should be. You could, however, modify some object that the experiment uses (for example, you could modify a query pipeline) while the experiment is running. But this makes it unclear what you have been testing. We recommend against this practice. Instead, stop the first experiment, make the modifications, and then activate (start) an experiment that uses the modified object.