Aggregator Scripting

The aggregation jobs that process signals can be customized with using JavaScript. There are several options for scripts, and each option will be executed at a different point of the aggregation process. The options available at each stage of the process will vary, as explained for each option below.

Scripts are run after the main logic of the class they are customizing. This allows overriding default behavior of the class if needed.

The scripts are defined in the aggregation job using the Signals Aggregator API, with the 'params' property. Here is an example of declaring a script in an aggregator definition, using the the specialFields script option:

{
  "id" : "r1",
  "signalTypes" : [ "click" ],
  "selectQuery" : "*:*",
  "timeRange" : "[* TO NOW]",
  "params" : {
    "specialFields" : "unless_pos_gt_1_ss,when_pos_lt_3_ss"
  }
}

Note:

In many cases, the scripts defined will be executed many times during the aggregation job (i.e., for every event). For this reason, it’s good practice to keep the scripts as simple as possible to avoid a negative impact on system performance. The initScript option includes a "_context" object that can be used for storing values that may require lengthy initialization or heavy computation.

initScript

A JavaScript defined with this option is executed wen the signal aggregator instance (i.e., the specific aggregator job) is initialized. The following objects are available to the script:

  • logger: an SLF4J Logger object.

  • aggregator: the aggregator instance.

  • initArgs: the intiation arguments.

  • _context: the current scripting context. This can be used for storing small objects between executions of other scripted methods.

startScript

A script defined with this option is executed when a new tuple is about to be aggregated. All of the objects available to initScript are available to startScript, plus:

  • type: the aggregation type, which is a string. Currently only the 'click' type is supported.

  • aggregationTime: the reference point from which the aggregation is calculated, which is expressed in epoch time, an integer.

  • currentTuple: a map of field names and values for the current tuple being aggregated.

aggregateScript

A script defined with this option is executed when a new event is being processed for the current tuple. All of the objects available to initScript and startScript are available, plus:

  • event: the current event for aggregation.This is a SolrDocument.

  • result: the aggregated result so far. This will also contain the original tuple fields. This is a SolrDocument.

If this script is present, it overrides the default logic for processing events. This means that the script must completely process the events as desired; it’s not possible to build on existing rules. Note also that defining an aggregateScript will override any options defined as specialFields, described below.

It’s possible to emit more than one result of aggregation for any given group of source events. This may be invoked in scripts, like the following snippet:

doc = $.prepareResult();
$.emit(doc);

"$" is a reference to the current instance of aggregation function. The "prepareResult" method finishes calculations of some of the more complex functions (e.g. topK, percentiles, correlation, etc) and updates the result PipelineDocument (note: after this function is called the current "result" document is discarded, and a new PipelineDocument will be created to hold results of aggregating the following events, and the returned document can’t be used for incremental calculations). An example use for this functionality would be to extract the month part of the date from a set of events which are sorted by timestamps, in order to produce aggregated results for every month within the current tuple defined by groupingFields.

finishScript

A script defined with this option is executed when all of the events for the current tuple have been processed and it’s time to return the aggregated result. All of the objects available to initScript, startScript and aggregator script are available, plus:

  • result: the final aggregated result. This is a SolrDocument.

specialFields

A script defined with this option uses a comma-separated, a whitespace-separated, or a JSON list of field names that are exempt from the default processing logic. These fields will not be processed in any way, which means they will not be included in the aggregated result.

If an aggregatorScript has been defined, it will be used instead of this option.

The default processing logic is as follows:

  • skip any fields declared in specialFields;

  • skip the event ID field (id);

  • if the field value is a Number, then sum up all values as a Double;

  • if field name ends with '_s' or '_dt', retain only the first value and discard all other values (these dynamic fields are single-value only);

  • otherwise add all values as-is to the result.

halfLife

This option allows defining a time period, in milliseonds, for the half-life decay formula. This formula is used when determining boost values for clicked documents: documents that have not been clicked in a longer period of time will not receive as high of a boost as documents that have been clicked more recently.

The default value is equivalent to 30 days (i.e., 2,592,000,000 ms).

weightScript

A script defined with this option is used when weighting the current event. It must evaluate to a numeric value, but has the following additional objects available:

  • event: the current event for aggregation. This is a SolrDocument.

  • result: the aggregated result so far. This will also contain the original tuple fields. This is a SolrDocument.

  • eventFlag: the flag that indicates if the event is the result of a previous aggregation ("aggr") or is a new event ("event").

  • eventTime: the timestamp of the event.

  • eventWeight: the initial event weight, expressed as a float.

  • defaultWeight: the default weight (if the script fails or the eventWeight is not entered properly), expressed as a float.

  • position: the click position. This is 0-based, or 0 if not available. The data is retrieved from the 'params_position_s' field of the event.