Aggregations

Signals are most useful when they are aggregated into a set of summaries that can be used to enrich the search experience through recommendations and boosting. When signals are enabled for a "primary" collection, a <primarycollectionname>_signals collection and a <primarycollectionname>_signals_aggr collection are created automatically.

Note
Aggregations must be configured from the collection that contains the signals data (usually <primarycollectionname>_signals), not the primary (<primarycollectionname>) collection. You can find the _signals collection by navigating to and expanding your original collection to display its system collections.

The Example Aggregation with Sample Data walks through the signals example distributed with Fusion.

Aggregation Pipelines

Aggregated events are indexed, and use a default pipeline named "aggr_rollup". This pipeline contains one stage, a Solr Indexer stage to index the aggregated events.

You can create your own custom index pipeline to process aggregated events differently if you choose.

Aggregation Functions

The section Aggregator Functions documents the available set of aggregation functions.

Custom aggregation functions can be defined via a JavaScript stage. The options described in Aggregator Scripting provide more detail on the objects available for scripts.

Aggregation properties

The aggregation process is specified by an aggregation type consisting of the following list of properties:

Name Description

id

Aggregation ID

groupingFields

List of signal field names

signalTypes

List of signal types

aggregator

Symbolic name of the aggregator implementation

selectQuery

Query string, default *:*

sort

Ordering of aggregated signals

timeRange

String specifying time range, e.g., [* TO NOW]

outputPipeline

Pipeline ID for processing aggregated events

outputCollection

Output collection name

rollupPipeline

Rollup pipeline ID

rollupAggregator

Name of the aggregator implementation used for rollups

sourceRemove

Boolean, default is false

sourceCatchup

Boolean, default is true

outputRollup

Boolean, default is true

aggregates

List of aggregation functions

params

Arbitrary parameters to be used by specific aggregator implementations

EventMinerAggregator

This aggregator type produces synthetic co-occurrence documents for predefined fields, based on co-occurring data in events in the same session.

This aggregator assumes that events have the following fields:

  • timestamp_tdt - event timestamp

  • user_id_s - user ID

  • query_s - (optional) query that is related to this event

  • doc_id_s - (optional) document ID that is related to this event

A session is then defined as a series of events created by a given user_id_s with timestamps falling within a session timeout limit (which is configurable via aggregator parameters).

Example document produced by this aggregator:

"id": "29d2c95e48154a4ebdc03158b0dd7875-25170",
  "entity_type_s":"doc_id_s",
  "entity_id_s": "2548405",
  "co_occurring_docIds_ss": [
    "2938114"
  ],
  "co_occurring_docIds_counts_is": [2],
  "in_queries_ss": [
    "tennis",
    "nicktoons kinect",
    "virtua tennis 4"
  ],
 "in_queries_counts_is":[3, 1 ,1],
  "in_session_ids_ss": [
    "f2de89f97e1638614795d3b03c50d5b1",
    "eae6242c6b58526d2a039d4bd95a85b6",
    "d87d264a528fd7ba162c20209bd3ca8a"
  ],
  "in_session_ids_counts_is":[1, 1, 1],
  "in_user_ids_ss": [
    "4c79f9ebcf7d50ba5d25a3fca0343929ff05c822",
    "2943dea5408b6f947bbd42aa28f9d8006bfb366a",
    "3fa43872d2275d5e6463ff2de1d95dca51d0003f"
  ],
  "in_user_ids_counts_is":[1, 1, 1]

This example illustrates the following:

  • This is a synthetic co-occurrence document for an entity of type doc_id_s, with entity ID "2548405" (that is, a document with this ID). Other similar documents are produced for sessions, queries and users as the primary entity ID.

  • The in_queries_ss field contains queries that led to this document (in this case, since events were click-throughs, these are the queries that resulted in clicks on a link to this document).

  • The related_docIds_ss field shows documents that users also clicked within the same search session (thus implicitly indicating that they are related to this one).

  • The in_user_ids_ss and in_session_ids_ss fields contains user IDs and session IDs respectively, where the click events for this document originated from.

This document can be viewed as a row in an N x N co-occurrence matrix, with the dimensions being (in this case): doc_id_s, user_id_s,query_s. Alternatively, it can be viewed as a vertex of a given type (e.g. doc_id_s) in a graph, with the fields representing edges to other vertices of different types.

The EventMinerAggregator accumulates this co-occurrence data from events in sessions, per each entity type, and then periodically outputs co-occurrence documents when its internal LRU cache becomes full. It also flushes all remaining entries from the cache at the end of a job.

This helps the aggregator to limit the total memory budget, while keeping a reasonably long context of co-occurring entities.

The size of this internal cache is adjustable via a configuration parameter. A side-effect of this approach is that there may be multiple partial co-occurrence documents produced if related entities occur in a longer context than the LRU cache is able to handle - consumers of the aggregated data should be prepared to roll-up these multiple partial documents.

Aggregation job configuration

The groupingFields should use just user_id_s, and optionally the "sort" parameter should be set to timestamp_tdt asc - this way the sessionization process will work most efficiently. On the other hand, sorting by timestamp requires more work on the Solr-side, so it may be omitted, with the possible side-effect that there will be additional partial documents created.

EventMinerAggregator configuration parameters

These parameters are passed in the "params" property of the aggregator configuration:

  • maxSessionTime: Optional integer, default value: 300. This specifies the maximum time (in seconds) to use for the definition of a session i.e. a series of actions by the same user in a given time period. Normally you want to keep this value fairly small, as events that occur close together in time are more likely to be related than those further apart.

  • maxElementsPerField: Optional integer, default value: 10. Configures the maximum number of elements to store in each field of the aggregated documents.

  • maxCacheSize: Optional integer, default value: 25000. This controls the maximum size of the internal LRU ("Least recently used") cache. Depending on the volume of data being processed, smaller values will result in more partial documents being created, while larger values will lead to a higher memory usage during the aggregation run.

Example Configuration

{
    "id": "event-miner-aggregation",
    "groupingFields": [
        "user_id_s"
    ],
    "signalTypes": [ ],
    "selectQuery": "*:*",
    "sort": "timestamp_tdt asc",
    "timeRange": "[* TO NOW]",
    "aggregator": "em",
    "sourceRemove": false,
    "sourceCatchup": false,
    "outputRollup": false,
    "aggregates": [ ],
    "params": {
        "maxSessionTime": 600,
        "maxElementsPerField": 20
        "maxCacheSize": 10000
    }
}

Notes:

  • The input documents will be grouped together based on their user_id_s values.

  • The selectQuery is set to *:*, which means match all values in all fields when returning the initial set of input records for processing.

  • The timeRange specifies all records with a timestamp up to NOW, where the latter will be set by the system to the time the aggregation job starts.

  • The value for aggregator is set to em, which is a short label for the "EventMinerAggregator".

  • The definition of the settings in the params section is as defined earlier.