A signal is a recorded event related to one or more documents in a collection. Signals can record any kind of event that is useful to your organization. Click signals are the most common type of signals as this is the most common action a user takes with an item. In addition, other signal types can be defined, such as "addToCart", "purchase", and so on.
Using a sufficiently large collection of signals, Fusion can automatically generate recommendations such as these:
Based on the user’s search query, which items are most likely to interest them?
Based on the user’s similarity to other users, which additional items are likely to interest them?
Signals are indexed in a secondary collection which is linked to the primary collection by the naming convention
<primarycollectionname>_signals. So, if your main collection is named
products, the associated signals collection is named
products_signals. The signals collection is created automatically when signals are enabled for the primary collection. Signals are enabled by default whenever a new collection is created.
Signals are indexed just like ordinary documents. The signals collection can be searched like any other collection, for example by using the Query Workbench with the signals collection selected.
App Insights provides visualizations and reports with which to analyze your signals. App Insights mainly uses raw signals, but also uses some aggregated signals. Currently only the signal types Request, Response and Click are supported within the App Insights dashboards.
See signals types and structure for more information.
Fusion uses SQL to produce these signal summaries. SQL is a familiar query language that is well suited to data aggregation. Fusion’s SQL Aggregation Engine is powerful and flexible.
Aggregations are created automatically whenever you enable signals or recommendations. This topic explains how to create or modify aggregations individually. You can do this using the Fusion UI or the Jobs API. For more information, see Creating Aggregations.
Most aggregation jobs run with the catch-up flag set to
true, which means that Fusion only computes aggregations for new signals that have arrived since the last time the job was run, and up to and including
ref_time, which is usually the run time of the current job. Fusion must "roll up" the newly aggregated rows into any existing aggregated rows in the
If you are using SQL to do aggregation but have not supplied a custom rollup SQL, Fusion generates a basic rollup SQL script automatically by consulting the schema of the aggregated documents. If your rollup logic is complex, you can provide a custom rollup SQL script.
Fusion’s basic rollup is a SUM of numeric types (long, integer, double, and float) and can support
time_decay for the weight field. If you are not using
time_decay in your weight calculations, then the weight is calculated using
weight_d. If you do include
time_decay with your weight calculations, then the weight is calculated as a combination of timestamp,
The basic rollup SQL can be grouped by
FILTERS. The last GROUP BY field in the main SQL is used so it will ultimately group any newly aggregated rows with existing rows.
This is an example of a rollup query:
SELECT query_s, doc_id_s, time_decay(1, timestamp_tdt, "30 days", ref_time, weight_d) AS weight_d , SUM(aggr_count_i) AS aggr_count_i FROM `commerce_signals_aggr` GROUP BY query_s, doc_id_s
When Fusion rolls up new data into an aggregation, time-range filtering lets you ensure that Fusion does not aggregate the same data over and over again.
Fusion applies a time-range filter when loading rows from Solr, before executing the aggregation SQL statement. In other words, the SQL executes over rows that are already filtered by the appropriate time range for the aggregation job.
Notice that the examples Perform the Default SQL Aggregation and Use Different Weights Based on Signal Types do not include a time-range filter. Fusion computes the time-range filter automatically as follows:
If the catch-up flag is set to
true, Fusion uses the last time the job was run and
ref_time(which you typically set to the current time). This is equivalent to the WHERE clause
WHERE time > last_run_time AND time <= ref_time.
If the catch-up flag is not set to
true, Fusion uses a filter with
ref_time(and no start time). This is equivalent to the WHERE clause
WHERE time <= ref_time.
The built-in time logic should suffice for most use cases. You can set the time range filter to TO and specify a WHERE clause filter to achieve more complex time based filtering.
Time range in Fusion is equivalent to date range in Solr.
Example values for time range:
[* TO NOW]- all past events
[2000-11-01 TO 2020-12-01]– specify by exact date
[* TO 2020-12-01]– from the first data point to the end of the day specified
[2000 TO 2020]- from the start of a year to the end of another year
A Spark SQL aggregation query can use any of the functions provided by Spark SQL. Use these functions to perform complex aggregations or to enrich aggregation results.
Fusion automatically uses a default
time_decay function to compute and apply appropriate weights to aggregation groups during aggregation. Larger weights are assigned to more recent events. This reduces the impact of less-recent signals. Intuitively, older signals (and the user behavior they represent) should count less than newer signals.
If the default
time_decay function does not meet your needs, you can modify it. The
time_decay function is implemented as a UserDefinedAggregateFunction (UDAF).
This is the UDAF signature of the default
time_decay(count: Long, timestamp: Timestamp, halfLife: String (calendar interval), ref_time: Timestamp, weight_d: Double)
One small difference between the prior and current behavior is worth mentioning in passing:
Prior to Release 4.0, the
decay_sumaggregator function used the difference between the
aggregationTime(the time at which the aggregation job is run) and the event time to calculate exponentially decayed numerical values.
In Release 4.0,
time_decayis a similar function. In
ref_timeis used instead of
aggregationTime. You can set
aggregationTimeto some other time than the run time of the aggregation job.
In practice, you will probably want to use
Your function call can also use this abbreviated UDAF signature, that omits
time_decay(count: Long, timestamp: Timestamp)
In this case, Fusion fills in these values for the omitted parameters:
Users indexing custom signals with weight_d specified should ensure that the "default" value matches the weight_d parameter used by
To match the results of legacy aggregation, either use the abbreviated function signature or supply these values for the mentioned parameters:
Number of occurrences of the event. Typically, the increment is 1, though there is no reason it could not be some other number. In most cases, you simply pass
The date-and-time for the event. This time is the beginning of the interval used to calculate the time-based decay factor.
Half life for the exponential decay that Fusion calculates. It is some interval of time, for example,
Reference time used to compute the age of an event for the time-decay computation. It is usually the time when the aggregation job runs (
Initial weight for an event if not specified in the signal, prior to the decay calculation. This value is typically not present in the signal data. If some signals do contain
You can use SQL to compute
This is an example of how Fusion calculates the age of a signal:
Imagine a SQL aggregation job that runs at
Tuesday, July 11, 2017 1:00:00 AM (1499734800).
For a signal with the timestamp
Tuesday, July 11, 2017 12:00:00 AM (1499731200), the age of the signal in relation to the reference time is 1 hour.
The "cold start" problem means it is hard to personalize the search experience when insufficient signals have been aggregated. For example, it is hard to offer recommendations to users who have never visited before, or for queries that have never been issued before, or for items that have been recently introduced into the system.
Fusion provides solutions for this problem using its query pipelines. A query pipeline that includes stages for blocking, boosting, or recommending based on signals can also include stages that provide fallbacks. In the case where there is not enough data to provide specialized blocking, boosting, or recommendations, the pipeline can return a simpler set of search results using Solr’s normal relevancy calculation.
A common solution to the cold start problem is to sort or boost on a certain field to provide pseudo-recommendations when more specific recommendations are not available. For example, you can sort on the
sales_rank field to recommend the most popular products, or boost on the
date_added field to recommend the newest items.