Experiments API (experimental)

Use the Experiments API to compare different configuration variants and determine which ones are most successful. For example, configure different variants that use different query pipelines or recommendations, then analyze and compare search activity to see which variant best meets your goals.

Experiments allow you to evaluate multiple variants (named sets of configuration parameters) that yield measurable effects. An Overall Evaluation Criterion (OEC) is used to measure the profit value for each variant so they can be compared quantitatively.

Experiment types include A/B testing, multi-armed bandits optimization, evaluating efficiency of ML models, and search relevance. Experiments can be based on any metrics that are well-defined and computable for a given variant.

Note
This was introduced in Fusion 3.0 and is subject to change.

Experiment life cycle

Created

The experiment and its variants are defined.

There is no in-memory or saved data for any of the variants yet, so the OEC is the same for all variants. The experiment is in idle state.

Edited

The experiments configuration has been modified.

Editing an experiments configuration make sense only before the experiment is started, or after all of its previous data is cleared. Otherwise there’s a risk of mixing data from old configuration with the new data, which would make results meaningless.

Started

The experiment is ready for updates and new iterations (draws), consisting of “select variant” and “update” operations.

In the current framework, experiments may be started/stopped multiple times (so the start/stop operations are equivalent to suspend/resume). If the service is restarted, the current state of experiment is initialized from any previously-saved state, if available.

Read

Current in-memory state is reported as well as the last saved results (if available).

Updated

One or more variants have been updated.

When the experiment is running, users can perform “select variant” and “update variant” operations repeatedly, not necessarily in order (though lack of updates may skew the automatic variant selection for bandit experiments).

Recomputed

The experiment results are recomputed.

The experiment must be stopped before recomputing. When this operation is completed, the in-memory state will be updated to reflect the computed results for each variant (including the OEC), and this state will be persisted (current implementation uses Fusions blob store).

Stopped

The experiment is in an idle state. The experiment will reject any new updates until it’s started again.

Reset

The experiment’s in-memory state is reset to the “zero” state, that is, a state that it would assume on “start” when no previously-saved data is available.

Unlike "cleared" (below), this operation does not affect the previously-saved results.

Cleared

Both the experiment’s in-memory state and previously-saved results are discarded and the experiment is put in its “zero” state.

Implementations may also discard any auxiliary data related to this experiment, such as raw events, logs of interactions, models, and so on.

Experiment types

A/B testing and multi-armed bandits

Multi-armed bandit algorithms are designed to balance the exploration of each variant with the exploitation of the best variants in order to maximize the profit as measured by the OEC. An A/B test is the simplest type of multi-armed bandit experiment, comparing just two variants: A and B.

The algorithm determines which variant to select next for the trial and update. The update after each draw drives the algorithm, so if there are no updates it’s likely that the same variant will be proposed repeatedly. The OEC is calculated in real time and can be reported immediately (the “recompute” step is a no-op), which makes this class of experiment very lightweight.

The simplest setup is to create just two variants (A and B). Variant A is customarily considered the “ground truth”, “gold standard”, or “control” configuration, and variant B is the test configuration or “treatment” to compare with the control configuration.

It is also common to perform A/A testing, that is, create a separate variant with exactly the same configuration as the control. A/A testing is useful for detecting any systemic errors.

The multi-armed bandits implementation allows you to perform both A/B and A/A testing as a part of the same experiment. To do this, create three variants: two variants with exactly the same “control” configuration and a third variant with the “treatment” configuration.

Implemented bandit algorithms

Epsilon Greedy

Explore with probability epsilon and exploit with probability 1 - epsilon.

When epsilon == 0.0, the algorithm uses annealing (automatically decreasing epsilon based on the number of draws so far).

This algorithm explores by selecting from all of the arms at random. It makes one of these random exploratory decisions with probability epsilon, otherwise always selecting the best arm.

Softmax

This algorithm explores by randomly selecting from all of the available arms with probabilities that are approximately proportional to the estimated value of each of the arms.

If other arms are noticeably worse than the best arm, they’re chosen with very low probability and the algorithm converges quickly to exploit the best variant. If the arms all have similar values, they’re each chosen nearly equally often and the algorithm may never converge.

When its parameter (called temperature) is close to 0.0, there is little randomness; the algorithm almost always selects the best arm (no exploration, only exploitation). As the temperature increases to Inf, it picks arms more randomly, thus increasing exploration at the cost of exploitation.

Typically a value of 0.1 yields good results when one of the arms is clearly better than others. A value of 0.0 causes the algorithm to use annealing, that is, gradually decreasing temperature over time.

UCB1

Upper Confidence Bounds type 1 (UCB1) algorithm.

This algorithm is deterministic and uses no parameters, which makes it much easier to use when the potential outcomes of experiment variants are difficult to predict. However, its accuracy and performance is somewhat lower than the best-tuned Epsilon Greedy or Softmax.

UCB1 first ensures that it has played each arm at least once, avoiding the problem of cold start (though it means that you must update it at least as many times as there are arms). Then it selects arms based on their accumulated value and a bonus for the relative lack of knowledge about the arm (the inverse of visit counts for that arm). This results in occasional selection of lesser-known arms with lower values. However, over time a strong preference for the best arm(s) develops.

OEC update strategy

Bandit algorithms use the numeric value property in the update payload and combine it with the accumulated value of the variant.

By default, the result is calculated as an arithmetic average of all values seen so far. However, as the number of draws increases to infinity, the impact of the recent updates becomes negligible.

Alternative implementation of update strategy is provided (AlphaUpdateStrategy), which proportionally decays the older accumulated value using a parameter alpha (ranging between 0.0 - 1.0). When alpha is close to 0.0, new values have a small impact on the accumulated value. When alpha is close to 1.0, new values are practically equal to the accumulated value. Common practice is to use alpha somewhere between 0.1 - 0.25.

IR quality metrics based on signals

This experiment type uses the QualityAggregator and signal data from arbitrary Spark DataFrame sources. Each set of signals represents data collected for a variant. These could be collected by external applications and sent to Fusion’s /signals or /index-pipelines endpoints.

Aggregation jobs are executed on collections of signals produced by different variants. The “recompute” operation uses aggregation jobs and related aggregation functions to compute the summary statistics and based on this it determines the OEC of each variant.

Signals are expected to represent multiple named lists of ranked results, each individual signal containing the following properties:

name

The list’s name. For example, for click log this is the query_s field.

item

The list’s item. For example, for click log this is the doc_id_s field that represents the clicked document ID.

rank

The rank of the item in the list. For example, for click log this is the params.position_s field that represents (1-based) rank of the clicked document on the list of results for a query.

tag

Optional property to separate signals belonging to different variants.

The source of signals for each variant is provided in the experiment variant’s configuration (as input property), and it represents a Spark DataFrame source (with format and options properties).

In case of this experiment type the decision about a variant selection is usually made elsewhere - in fact, it could be driven by a multi-armed bandits experiment running in parallel. By default each variant in turn is returned from “select variant” in a round-robin fashion. Updates will be usually provided somewhere else, too, eg. using the /signals or /index-pipelines endpoints, although the experiment API provides a pass-through to the /signals endpoint.

Arbitrary aggregations

This experiment type uses regular aggregation configurations.

Each experiment variant is configured with an existing aggregation ID. When the “recompute” action is invoked, the respective aggregation job is executed, and its summary statistics are used for the OEC calculation.

Machine learning pipelines

In this experiment type, each variant corresponds to a different configuration of a machine learning pipeline. Variant selection could be external or driven by multi-armed bandits.

Models build and updated using each variant’s configuration will be stored in Fusion’s blob store and only a reference to their location will be stored in the variant’s configuration (although for the purpose of updating they may need to be all loaded in memory).

Integration with query pipelines and index pipelines

Experiment Query Stage

This query stage lets you select a variant from a running experiment and inject its properties into the current PipelineContext. These properties can then be used by other query pipeline stages.

The Solr Index Stage uses these properties as overrides for request parameters. The properties that this stage sets (including the experiment ID and variant ID) are also returned in response, and can be further propagated to external applications, which can then include these ID-s in feedback to the experiment.

Experiment Update Stage

This indexing stage provides a way to apply updates sent to /index-pipelines or /signals directly to the experiment - in case of e.g. multi-armed bandits this method offers a much quicker turnaround than using aggregations.

Input documents are checked for the presence of experiment ID / variant ID and the value to be used for updating the experiment. By default these documents are silently discarded after processing, but the stage can be configured to forward them down the pipeline to the next stages.