Signals

A signal is a recorded event related to one or more documents in a collection. Signals can record any kind of event that is useful to your organization. Click signals are the most common type of signals as this is the most common action a user takes with an item. In addition, other signal types can be defined, such as "addToCart", "purchase", and so on.

Using a sufficiently large collection of signals, Fusion can automatically generate recommendations such as these:

  • Based on the user’s search query, which items are most likely to interest them?

  • Based on the user’s similarity to other users, which additional items are likely to interest them?

Signals are indexed in a secondary collection which is linked to the primary collection by the naming convention <primarycollectionname>_signals. So, if your main collection is named products, the associated signals collection is named products_signals. The signals collection is created automatically when signals are enabled for the primary collection. Signals are enabled by default whenever a new collection is created.

Signals are indexed just like ordinary documents. The signals collection can be searched like any other collection, for example by using the Query Workbench with the signals collection selected.

App Insights provides visualizations and reports with which to analyze your signals. App Insights mainly uses raw signals, but also uses some aggregated signals. Currently only the signal types Request, Response and Click are supported within the App Insights dashboards.

Note
The signals schema changed in Fusion 4.0. See the descriptions of signals types and structure below.

Enabling and disabling signals

You can enable and disable signals using the Fusion UI or the REST API.

Tip
When you disable signals, the aggregation jobs are deleted, but the _signals and _signals_aggr collections are not; your legacy signal data remains intact.

Using the UI

When you create a collection using the Fusion UI, signals are enabled and a signals collection created by default. You can also enable and disable signals for existing collections using the Collections Manager.

Enable signals for a collection
  1. In the Fusion workspace, navigate to Collections > Collections Manager.

  2. Hover over the primary collection for which you want to enable signals.

  3. Click Configure icon Configure to open the drop-down menu.

    Enable Signals

  4. Click Enable Signals.

    The Enable Signals window appears, with a list of collections and jobs that are created when you enable signals.

    Enable Signals

  5. Click Enable Signals.

Disable signals for a collection
  1. In the Fusion workspace, navigate to Collections > Collections Manager.

  2. Hover over the primary collection for which you want to disable signals.

  3. Click Configure icon Configure to open the drop-down menu.

  4. Click Disable Signals.

    The Disable Signals window appears, with a list of jobs that are created when you enable signals.

  5. Click Disable Signals.

    Your _signals and _signals_aggr collections remain intact so that you can access your legacy signals data.

Using the Collection Features API

Using the API, the /collections/{collection}/features/{feature} endpoint enables or disables signals for any collection:

Check whether signals are enabled for a collection
curl -u user:pass http://localhost:8764/api/collections/<collection-name>/features/signals
Enable signals for a collection
curl -u user:pass -X PUT -H "Content-type: application/json" -d '{"enabled" : true}' http://localhost:8764/api/collections/<collection-name>/features/signals
Disable signals for a collection
curl -u user:pass -X PUT -H "Content-type: application/json" -d '{"enabled" : false}' http://localhost:8764/api/collections/<collection-name>/features/signals

Signals data flow

This diagram shows the flow of signals data from the search app through Fusion AI. The numbered steps are explained below.

Signals data flow

  1. The search app sends a query to a Fusion query pipeline.

    The query request should include a user ID and session query parameter to identify the user.

  2. Optionally, the Fusion query pipeline queries the _signals_aggr collection to get boosts for the main query based on aggregated click data.

  3. The search app also sends a request signal to the Fusion /signals endpoint.

    The primary intent of a request signal is to capture the raw user query and contextual information about the user’s current activity in the app, such as the user agent and the page where they generated the query. The request signal does not contain any information about the results sent to Solr; it is created before a query is processed.

  4. Once Solr returns the response to Fusion, the SearchLogger component indexes the complete request/response data into the _signals collection as a response signal using the _signals_ingest pipeline. Therefore, the response signal captures all results from Fusion as it related to the original query.

    Note
    This is a departure from pre-4.0 versions of Fusion where query impressions were logged in a separate _logs collection. Query activity is no longer indexed into the _logs collection. All response signals use the fusion_query_id (see below) as the unique document ID in Solr.
  5. When the user clicks a link in the search results, the search app sends a click event to the Fusion signals endpoint (which invokes the _signals_ingest pipeline behind the scenes).

    The click signal must include a field named fusion_query_id in the params object of the raw click signal. The fusion_query_id field is returned in the query response (from step 1) in a response header named x-fusion-query-id. This allows Fusion to associate a click signal with the response signal generated in step 4. The fusion_query_id is also used by Fusion to associate click signals with experiments. For experiments to work, each click signal must contain the corresponding fusion_query_id that produced the document/item that was clicked.

  6. The _signals_ingest pipeline enriches signals before indexing into the _signals collection.

    This enrichment includes field mapping, geolocation resolution, and updating the has_clicks flag to "true" on request signals when the first click signal is encountered for a given request using the Update Related Document index stage.

  7. Fusion’s App Insights queries the _signals collection through a Fusion query pipeline to generate query analytics reports from raw signals.

    Note that App Insights app uses Fusion security for authentication.

  8. Behind the scenes, the SQL aggregation framework aggregates click signals to compute a weight for each query + doc_id + filters group.

    The resulting metrics are saved to the _signals_aggr collection to generate boosts on queries to the main collection (step 2 above).

  9. Recommendations also use aggregated documents in the _signals_aggr collection to build a collaborative filtering-based recommender model.

Signals types and structure

Signals can be broadly categorized as implicit or explicit. When signals are enabled, Fusion produces several built-in signal types by default, all of which are implicit signals. You can also create custom signal types, including explicit signals. Be sure to verify that your signals include all of the important fields for best results. It’s also useful to rank your signal types in terms of how strongly each type indicates a user’s interest in an item.

Implicit signals vs explicit signals

Signals can reveal a user’s level of interest in an item in two main ways:

  • Implicit

    The user shows interest by engaging with the item/document through clicks, searches, and so on. Since this type of interaction requires no additional effort on the user’s part, these types of signals tend to be plentiful. They can be used to infer a measurable value of interest in order to build an accurate recommender system.

  • Explicit

    An explicit signal is created when a user intentionally assigns a clear, measurable value to an item, such as by giving it a rating. This value can be used to rank items, for example. Since this requires the user to invest extra time to provide the information, the number of ratings tends to be small compared to the total number of users interacting with the item.

You can create recommendations based on implicit signals out of the box. For recommenders based on explicit signals, contact your Lucidworks Professional Services representative.

Built-in signal types

There are three built-in signal types:

Request signals

A request signal is generated by a front-end search app and captures the raw user query and other contextual information about a user and their journey through the search app. A request signal should have the following fields:

[
  {
    "id":"288fe4f7-6680-403e-8d18-27647cdd9989",
    "timestamp":1518717749409,
    "type":"request",
    "params":{
      "user_id":"admin",
      "session":"ef4e00cd-91bb-45b4-be80-e81f9f9c5b27",
      "query":"USER QUERY HERE",
      "app_id":"SEARCH APP ID",
      "ip_address":"0:0:0:0:0:0:0:1",
      "host":"Lucids-MacBook-Pro-5.local",
      "filter":[
        "field1/value",
        ...
      ],
      "filter_field":[
        "field1"
      ]
    }
  }
]

Additional optional fields are used by App Insights. In the raw signal, optional fields should be inside the params object. Optional fields are as follows:

"page_title":"Fusion Search",
"path":"/search",
"browser_type":"Browser",
"browser_version":"64.0.3282.140",
"browser_name":"Chrome",
"referrer":"http://localhost:8080/",
"ctx_prev_uri":"/",
"ctx_prev_query":"",
"ctx_prev_path":"/",
"os_manufacturer":"Apple Inc.",
"os_name":"Mac OS X",
"os_id":"778",
"os_device":"Computer",
"os_group":"Mac OS X"

Response signals

Response signals are automatically generated by a query pipeline when the signals feature is enabled for a collection.

Note
Front-end search applications should not send response signals to Fusion directly, as those would conflict with the auto-generated signals.

A response signal has the following explicit fields, plus any additional query parameters sent by the search application for a query:

Field Name Description Example

id

The x-fusion-query-id generated by the query-pipeline used for associating click signals with queries in experiments and aggregation jobs.

TwWCn3Dz

type

Signal type

response

response_type

Used by Insights to determine if this query had results or was empty

results | empty

session

User session ID; the search app should pass the session ID in the query params for a query

UUID

query

The actual query string sent to Solr from Fusion

ipad

query_orig_s

The incoming query from the search app before it is enriched by the query pipeline

ipad

query_id

A hash generated from the session, query, and filters fields; used as a rollup key in Insights to group activity by a specific

SHA1 hash

filters_s

Filter queries sent to Solr; the Fusion SearchLogger component combines multiple fq parameters into a single value delimited by " $ "

{!tag=format}format:(vhs) $ {!tag=type}type:(movie)

filter

Reformatted filter queries for use by App Insights

field1/value

user_id

User ID; the search app should pass the user_id in the query params

admin

doc_ids_s

A comma-delimited list of document IDs returned for the page of results; this field is used by Fusion Spark jobs, such as the ground truth job, to perform click/skip analysis

123,456,789

pipeline_id

Fusion query pipeline that processed this query

_system

collection

Fusion collection

my_collection

qtime

Query time from Solr, in milliseconds

10

rows

Number of rows requested for this query

10

hits

Total number of documents matching the query

10000

totaltime

Total processing time of this query in milliseconds, includes Solr qtime and Fusion query processing time

15

timestamp_tdt

Timestamp when the query request was received by Fusion

2018-02-15T18:17:42.560Z

res_offset

Offset of results; this field is used by experiment metrics to calculate MRR

0

params.*

Any other query param sent from the search app to Fusion that was not already mapped to a declared field

params.defType_s=edismax

Fusion’s experiment framework relies heavily on response signals and the linking between response and clicks signals using the fusion_query_id.

Click signals

Click signals are sent from the search app to Fusion. All click signals should include a fusion_query_id field pulled from the query response header x-fusion-query-id. In addition, click signals should include the following fields:

[
  {
    "id":"SOME UUID HERE",
    "timestamp":1518725351750,
    "type":"click",
    "params":{
      "fusion_query_id":"ABkaEA11",
      "user_id":"admin",
      "session":"b3a15101-9e30-4e28-8a23-d1f663c2ee06",
      "query":"tiger woods",
      "ctype":"result",
      "res_offset":0,
      "filter":[
        "type/Game"
      ],
      "ip_address":"0:0:0:0:0:0:0:1",
      "host":"Lucids-MacBook-Pro-5.local",
      "doc_id":"9502308",
      "app_id":"SEARCH APP ID",
      "res_pos":1,
      "filter_field":[
        "type"
      ]
    }
  }
]

Additional optional fields are used by App Insights. In the raw signal, optional fields should be inside the params object. Optional fields are as follows:

"browser_type":"Browser",
"browser_version":"64.0.3282.140",
"browser_name":"Chrome",
"referrer":"http://localhost:8080/",
"ctx_prev_uri":"/",
"ctx_prev_query":"",
"ctx_prev_path":"/",
"os_manufacturer":"Apple Inc.",
"os_name":"Mac OS X",
"os_id":"778",
"os_device":"Computer",
"os_group":"Mac OS X"
"url":"http://localhost:8080/#/product/9502308",
"label":"Tiger Woods PGA Tour 09 All-Play - Nintendo Wii",

Custom signal types

The signal type parameter can also take arbitrary values for custom signal types. For example, you can create special signals for purchase events, cart addition/subtraction events, "favorite" or "like" events, customer service events, and so on.

To collect custom signals, configure your front-end search application to send signals to Fusion using a custom value for the type field. Custom signals should also include the fields described below in order to get the best results from aggregation and recommendation jobs.

To use custom signals in recommendations, you must add them to the value of the signalTypeWeights parameter in the configuration for the _user_item_preferences_aggregation job and the _user_query_history_aggregation job.

Custom signals can be analyzed in App Insights just like pre-defined signal types.

Important fields for signals

The jobs that aggregate signals and generate recommendations work best when all of the following fields are present in your signals:

Field Name Example Value Description

count_i

1

Number of times an interaction event occurred with this item

doc_id

NMDDV

Product ID or Item ID

id

68f66808-6bfc-4d73-95f7-8a558529160b

The signal ID. If no ID is supplied, one will be automatically generated.

query

xwearabletech

A query string from the user

session_id

91aa66d11af44b6c90ccef44d055cf9a

Id for session in which user generated the signal

type

quick_view_click

Type of session the user used to interact with the platform

user_id

11506893

ID of user during the session timestamp_tdt

Some signal types, including custom signal types, may include additional fields.

Signal field count analysis

Lucidworks recommends performing signal field count analysis to determine whether any of the fields above are missing from some of your signals.

The table below shows how to query for specific fields using the Query Workbench in order to compare the number of results for each field with the total number of documents in the signals collection. In the examples in the third column, some fields appear in all 33,477,919 signals documents, while others appear in fewer documents.

Field name Query Example number of documents

ALL

*:*

33,477,919

count_i

count_i:[* TO *]

11,101,165

doc_id

doc_id:[* TO *]

23,216,297

id

id:[* TO *]

33,477,919

query

query:[* TO *]

19,724,598

session_id

session_id:[* TO *]

11,101,165

type

type:[* TO *]

33,477,919

user_id

user_id:[* TO *]

26,117,399

timestamp_tdt

timestamp_tdt:[* TO *]

26,117,399

You can also get the number of signals documents that contain all of the required fields by using the following query:

count_i:[* TO *] doc_id:[* TO *] id:[* TO *] query:[* TO *] type:[* TO *] user_id:[* TO *] timestamp_tdt:[* TO *] session_id:[* TO *]

The query_id field

For each incoming signal, Fusion calculates a value for the query_id field, which App Insights uses to create group-by-query reports like the one shown below:

Facet filters applied report

Note
The query_id field should not be confused with the fusion_query_id, which is a unique ID for each query processed by a Fusion query pipeline, or with query_s which is the query string.

To calculate the value, Fusion creates a hash based on session, query, and filter fields, then saves it into the query_id field.

The filter field can either be passed in by the search app, or computed by the SignalFormatterStage (the first stage in the _signals_ingest pipeline) using the raw filter queries. For instance, on a response signal that is generated by a query pipeline, the following fq query params get translated into the multi-valued filter field:

  • Raw query parameters:

    fq={!tag=format}format:(VHS)&fq={!tag=type}type:(Movie)
  • filters_s field (created by the SearchLogger component):

    {!tag=format}format:(vhs) $ {!tag=type}type:(movie)
  • filter field:

    "filter":["format/VHS", "type/Movie"]

App Insights uses the filter field to generate various reports.

Signal type ranking

When you have defined some custom fields, it is useful to rank them according to how strongly they indicate a user’s interest in an item. While it’s not necessary to exclude certain signal types from the main signals collection, some can be excluded from signal aggregations in order to focus on the most important fields when generating recommendations.

How to get the list of signal types
  1. In the Fusion UI, select your signals collection.

  2. Open the Query Workbench by navigating to Query > Query Workbench.

  3. Click Add a field facet.

  4. Select the type field.

    The list of signal types appears in the facet panel:

    Signal types

The default signals index pipeline

When indexing signals, Fusion uses a default index pipeline named _signals_ingest unless you specify a different index pipeline.

The _signals_ingest index pipeline has five stages:

  1. Format Signals stage

  2. Field Mapping stage

  3. GeoIP Lookup stage

  4. Solr Indexer stage

  5. Update has_clicks flag stage

    The Update has_clicks flag stage is an instance of the Update Related Document stage that updates the has_clicks flag to "true" on an existing request signal after the first click signal is processed for the request.

    Update Related Documents stage configuration

    The update stage works as follows:

    1. When a click signal is encountered (type == click)

    2. Look at the incoming click signal for a field named request_id_s, which gets set by the Format Signals stage using a distributed cache of recently processed request signals.

      If the request_id_s field is set, then send a real-time GET query to Solr to find a request signal with ID equal to the value of the request_id_s field on the click signal. To avoid re-updating request signals, the RTG query also filters on has_clicks==false, which avoids duplicate atomic updates on the same document in Solr. Real-time GET is used to avoid timing issues between a request signal being sent to Solr and when it gets committed. This prevents missing updates when clicks occur soon after the initial request signal is sent by the search app.

    3. If the click signal does not have the request_id_s field set, then do a normal Solr lookup for the request signal using: +query_id:"${query_id}" +type:request +has_clicks:false. A click signal may not have a request_id_s if there is a cache miss in the distributed cache used by the Format Signals stage.

    4. If the stage performs a normal query, there may be multiple request signals that have the same query_id. This is because the query_id is based on session + query + filter, so if a user sends the same query + filter during the same session, there will be multiple request signals with the same query_id value. Thus, the stage sorts to get the latest request signal to update.

    5. If a related document is found (in this case a request signal), then the stage updates the has_clicks field to true and performs an atomic update in Solr.

    This stage performs its work in a background thread, so it does not impact the indexing performance of the click signal.

Deleting old signals

Signals are not automatically deleted by default, and over time they occupy an increasing amount of storage space.

To avoid running out of storage space as a result of your growing collection of signals, you must decide on a signals retention policy, then configure and schedule a REST Call job that regularly deletes old signals.

The duration for which signals should be kept depends on your use case and your organization’s policies. For example, in some cases signals could be deleted after 30 days, while in other cases they may remain useful for a year, or even forever.

How to configure a REST Call job to delete old signals
  1. Navigate to Collections > Jobs.

  2. Click Add and select REST Call.

    The REST Call job configuration panel appears.

  3. Enter an arbitrary ID for this job, such as "Delete-old-signals".

  4. Enter the following endpoint URI, substituting the name of your signals collection for signalsCollectionName:

    solr://signalsCollectionName/update
  5. In the Call Method field, select "post".

  6. Under Query Parameters, enter the property name "wt" with the property value "json".

  7. In the Request entity (as string) field, enter the following:

    <root><delete><query>timestamp_tdt:[* TO NOW-2WEEKS]</query></delete><commit /></root>

    See Working with Dates for details about date formatting.

    Your job configuration should look similar to this:

    Signals delete job configuration

Tip
You can configure a schedule for this job at System > Scheduler.