Signals Aggregator API

The Signals Aggregator API is used to aggregate signal events, which allows faster querying for recommendations. To use recommendations, signals need to be recorded and then aggregated.

When signals are enabled for a collection, two system-level collections are created. The first is named collection_signals, where collection is the sibling collection name, and signal events are indexed to this collection. The second is named collection_signals_aggr, and is the default location for aggregated signal events. See Signals API for more information on how to index signal events.

The aggregation process creates tuples for the fields selected when creating the aggregator job. A default tuple is applied if none is specified.

The aggregation process can remove the raw signals if desired, or keep them for other aggregation jobs.

Signals Aggregator Definitions Properties

Parameter Description

id
Optional

A unique identifier for this aggregator job.

groupingFields
Optional

The fields that define unique tuples. The fields list is defined as a JSON array, with commas between each field name.

If a set of fields is not defined, then a default tuple 'doc_id_s','query_s','filters_s' will be used.

signalTypes
Optional

The types of signals to aggregate. The type list is defined as a JSON array, with commas between each type.

The types must be existing types used for events in your signals collection.

aggregator
Optional

The name of the aggregator implementation.

If it is not defined, this will default to click , which is an implementation optimized for aggregating signals based on user clicks. Aggregated records from this implementation will include a 'weight_d' field which can be used in boosting clicked documents.

If you are not aggregating user click events, you can choose simple. This implementation does not add a 'weight_d' field to each record.

A third option is special is described in more detail in page Aggregator Scripting.

selectQuery
Optional

Any query to identify signal events.

timeRange
Optional

A valid range query to select events to aggregate.

sort
Optional

Specifies ordering of raw signal events within an aggregation.

The default ordering is by event id ("id asc"). It can be set to use other fields using the standard Solr sort expressions, e.g. "timestamp_dt asc", also multiple criteria separate by comma, e.g. "type_s asc,timestamp_dt desc".

Note: the sorting by "id asc" is always appended as the last sort criteria in order to break ties.

outputPipeline
Optional

The name of a pipeline to use for processing aggregating events.

outputCollection
Optional

The collection in which to store the aggregated events.

rollupPipeline
Optional

The pipeline to use for rollups.

rollupAggregator
Optional

The name of the aggregator implementation to use for rollups.

sourceRemove
Optional

If true, then signal events that have been aggregated will be removed from the index.

The default is false.

sourceCatchup
Optional

If true, the original time range of the aggregation will be modified to span only the period since the last successful aggregation.

The default is false.

outputRollup
Optional

If true, the default, after performing the source data aggregation an additional aggregation step will be executed to roll-up the new aggregates with old aggregates that exist in the output collection for the same aggregation type.

aggregates
Optional

A list of aggregation functions. Since it’s possible to pass side-effects from one function to a later function in the list, the functions should be declared in the desired order of execution.

The available aggregator functions are described in more detail in the section Aggregator Functions.

params
Optional

The params allows defining aggregation job parameters.

The most common use of this property is to define JavaScript scripts to customize the aggregator behavior. See the section Aggregator Scripting for more details.

Note that for large aggregation definitions, you could create a .json formatted file with the desired properties and upload it with cURL’s -d parameter.

No output is returned when creating or updating an aggregator job.

When a job is listed, the properties returned are the same as the possible properties when defining a job.

Examples

Create an aggregator job for the click type of signals, with an aggregate function to provides counts by the id field:

REQUEST

curl -u user:pass -X POST -H 'Content-Type: application/json' -d '{"id":"1", "signalTypes":["click"], "aggregates":[{"type":"count", "sourceFields":["id"], "targetField": "count_d"}]}' http://localhost:8764/api/apollo/aggregator/aggregations

RESPONSE

None.

Update the properties for aggregator job '1', including all the original properties plus the ones we want to add or change:

REQUEST

curl -u user:pass -X PUT -H 'Content-Type: application/json' -d '{"signalTypes":["click"], "timeRange":"[NOW/-1 TO NOW]", "aggregates":[{"type":"count", "sourceFields":["id"], "targetField": "count_d"}]}' http://localhost:8764/api/apollo/aggregator/aggregations/1

RESPONSE

None.

List the properties for aggregator job '1':

REQUEST

curl -u user:pass http://localhost:8764/api/apollo/aggregator/aggregations/1

RESPONSE

{
  "id" : "1",
  "groupingFields" : [ ],
  "signalTypes" : [ "click" ],
  "timeRange" : "[NOW/-1 TO NOW]",
  "sourceRemove" : false,
  "sourceCatchup" : false,
  "outputRollup" : false,
  "aggregates" : [ {
    "type" : "count",
    "sourceFields" : [ "id" ],
    "targetField" : "count_d",
    "params" : { }
  } ],
  "params" : { }
}

Start job '1' on the 'demo_signals' collection:

REQUEST

curl -u user:pass -X POST http://localhost:8764/api/apollo/aggregator/jobs/demo_signals/1

RESPONSE

The following output has been truncated to omit the aggregation job definition and only shows the other job properties that are returned on start.

{
  "signals" : {
    "types" : [ "click" ],
    "stats" : { }
  },
  "state" : "running",
  "job_id" : "4d69ec73358b41d38caf1eb3b378809e",
  "aggregation_time_date" : "2014-09-11T16:39:58.347Z",
  "aggregation" : {
    "id" : "r1",
    "groupingFields" : [ "doc_id_s", "query_s", "filters_s" ],
    "signalTypes" : [ "click" ],
    "selectQuery" : "*:*",
...
  "output_collection" : "bestbuy_signals_aggr",
  "NOW" : 1410453598347,
  "NOW_date" : "2014-09-11T16:39:58.347Z",
  "collection" : "bestbuy_signals",
  "aggregation_time" : 1410453598347,
  "compound_id" : "bestbuy_signals:r1"
}

See the list of aggregator job items:

REQUEST

curl -u user:pass http://localhost:8764/api/apollo/history/aggregator/items

RESPONSE

[ "demo_signals:1" ]

Get the history of job "demo_signals:1":

REQUEST

curl -u user:pass http://localhost:8764/api/apollo/history/aggregator/items/demo_signals:1

RESPONSE

{
  "events" : [ {
    "start" : "2014-04-16T20:45:16.582Z",
    "end" : "2014-04-16T20:45:16.781Z",
    "source" : "demo_signals:1",
    "type" : "run",
    "status" : "ok",
    "details" : {
      "signals" : {
        "click" : {
          "state" : "finished",
          "raw" : 2,
          "aggr_type_s" : "click",
          "aggr_class" : "com.lucidworks.apollo.service.aggregation.ClickSignalAggregator",
          "aggregated" : 2
        }
      },
      "state" : "finished",
      "job_id" : "467bc0db-a9c9-4b48-8080-439958818907",
      "aggregation_time_date" : "2014-04-16T20:45:16.556Z",
      "aggregation" : {
        "id" : "1",
        "fields" : [ "doc_id_s", "query_s", "filters_s" ],
        "types" : [ "click" ],
        "select" : "*:*",
        "range" : "[* TO NOW]",
        "remove" : false,
        "rolling" : false,
        "params" : { },
        "anyAggr" : false
      },
      "NOW" : 1397681116556,
      "commit" : "done",
      "NOW_date" : "2014-04-16T20:45:16.556Z",
      "collection" : "demo_signals",
      "aggregation_time" : 1397681116556,
      "compound_id" : "demo_signals:1"
    },
    "error" : null
  } ]
}