Built-in SQL Aggregation Jobs Using Cloud Storage Buckets
.parquet
and .orc
filesPlaceholder | Description |
---|---|
<key name> | Name of the Solr GCS service account key. |
<key file path> | Path to the Solr GCS service account key. |
values.yaml
:Parameter Name | Example Value | Notes |
---|---|---|
SOURCE COLLECTION | gs://<path_to_data>/*.parquet | URI path that contains the desired signal data files. This example returns all .parquet files in the specified directory using the gs scheme to access Google Cloud Storage (GCS). |
DATA FORMAT | parquet | File type of the input files. Valid values include parquet and orc . |
Parameter Name | Example Value | Notes |
---|---|---|
spark.kubernetes.driver.secrets.{secret-name} | /mnt/gcp-secrets | The {secret-name} obtained during configuration. Example: example-serviceaccount-key . |
spark.kubernetes.executor.secrets.{secret-name} | /mnt/gcp-secrets | The {secret-name} obtained during configuration. Example: example-serviceaccount-key . |
spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS | /mnt/gcp-secrets/{secret-name}.json | Path to the .json file used to create the {secret-name} . Example: example-serviceaccount-key.json . |
spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS | /mnt/gcp-secrets/{secret-name}.json | Path to the .json file used to create the {secret-name} . Example: example-serviceaccount-key.json . |
spark.hadoop.google.cloud.auth.service.account.json.keyfile | /mnt/gcp-secrets/{secret-name}.json | Path to the .json file used to create the {secret-name} . Example: example-serviceaccount-key.json . |
values.yaml
:Parameter Name | Example Value | Notes |
---|---|---|
SOURCE COLLECTION | s3a://<path_to_data>/*.parquet | URI path to the desired signal data files. Returns all Parquet files in the directory. The s3a scheme is used for AWS S3 access. |
DATA FORMAT | parquet | File type of the input data. Other supported value: orc . |
Parameter Name | Example Value | Notes |
---|---|---|
spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID | {aws-secret-key} | The aws-secret:key obtained during configuration. |
spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY | {aws-secret-secret} | The aws-secret:secret obtained during configuration. |
spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID | {aws-secret-key} | The aws-secret:key obtained during configuration. |
spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY | {aws-secret-secret} | The aws-secret:secret obtained during configuration. |
core-site.xml
file into the job-launcher
pod at /app/spark-dist/conf
:Parameter Name | Example Value | Notes |
---|---|---|
SOURCE COLLECTION | wasbs://<path_to_data>/*.parquet | URI path that contains the desired signal data files. This example returns all Parquet files in the directory. wasbs is used for Azure data. |
DATA FORMAT | parquet | File type of the input data. Other valid value: orc . |
Parameter Name | Example Value | Notes |
---|---|---|
spark.hadoop.fs.wasbs.impl | org.apache.hadoop.fs.azure.NativeAzureFileSystem | Makes the system file available inside the Spark job. |
spark.hadoop.fs.azure.account.key.{storage-account-name}.blob.core.windows.net | {access-key-value} | Obtain the values for {storage-account-name} and {access-key-value} from the Azure UI. |
COLLECTION_NAME_signals
and COLLECTION_NAME_signals_aggr
collections, plus the aggregation jobs described below. You can view these jobs at collections > jobs.
Job | Default input collection | Default output collection | Default schedule |
---|---|---|---|
COLLECTION_NAME_click_signals_aggregation | COLLECTION_NAME_signals | COLLECTION_NAME_signals_aggr | Every 15 minutes |
COLLECTION_NAME_session_rollup | COLLECTION_NAME_signals | COLLECTION_NAME_signals | Every 15 minutes |
Job | Default input collection | Default output collection | Default schedule |
---|---|---|---|
COLLECTION_NAME_user_item_prefs_agg | COLLECTION_NAME_signals | COLLECTION_NAME_recs_aggr | Once per day |
COLLECTION_NAME_user_query_history_agg | COLLECTION_NAME_signals | COLLECTION_NAME_signals_aggr | Once per day |
COLLECTION_NAME_click_signals_aggregation
job computes a time-decayed weight for each document, query, and filters group in the signals collection. Fusion computes the weight for each group using an exponential time-decay on signal count (30 day half-life) and a weighted sum based on the signal type. This approach gives more weight to a signal that represents a user purchasing an item than to a user just clicking on an item.
query | count_i | type | timstamp_tdt | user_id | doc_id | session_id | fusion_query_id | |
---|---|---|---|---|---|---|---|---|
Required signals fields: | required | required | required | required | required | See note below. |
signalTypeWeights
SQL parameter in the Fusion Admin UI.
signalTypeWeights
parameter into a WHERE IN
clause to filter signals by the specified types (click, cart, purchase), and also passes the parameter into the weighted_sum
SQL function. Notice that Fusion only displays the SQL parameters and not the actual SQL for this job. This is to simplify the configuration because, in most cases, you only need to change the parameters and not worry about the actual SQL. However, if you need to change the SQL for this job, you can edit it under the Advanced toggle on the form.
COLLECTION_NAME_click_signals_aggregation
job to use a parquet file as the source of raw signals instead of a signal Fusion collection.COLLECTION_NAME_session_rollup
job aggregates related user activity into a session signal that contains activity count, duration, and keywords (based on user search terms).
The Fusion App Insights application uses this job to show reports about user sessions.
Use the elapsedSecsSinceLastActivity
and elapsedSecsSinceSessionStart
parameters to determine when a user session is considered to be complete. You can edit the SQL using the Advanced toggle.
The COLLECTION_NAME_session_rollup
job uses signals as the input collection and output collection. Unlike other aggregation jobs that write aggregated documents to the COLLECTION_NAME_signals_aggr
collection, the COLLECTION_NAME_session_rollup
job creates session signals and saves them to the COLLECTION_NAME_signals
collection.
COLLECTION_NAME_user_item_preferences_aggregation
job computes an aggregated weight for each user/item combination found in the signals collection. The weight for each group is computed using an exponential time-decay on signal count (30 day half-life) and a weighted sum based on the signal type.
query | count_i | type | timstamp_tdt | user_id | doc_id | session_id | fusion_query_id | |
---|---|---|---|---|---|---|---|---|
Required signals fields: | ✅ | ✅ | ✅ | ✅ | ✅ |
signalTypeWeights
SQL parameter to set the correct signal types and weights for your dataset. Its value is a comma-delimited list of signal types and their stakeholder-defined level of importance. Think of this numeric value as a weight that tells which type of signal is most important for determining a user’s interest in an item. An example of how to weight the signal types is shown below:
click:1.0, cart:100000.0
is too wide of a spread. The values of click:1.0
and cart:50.0
would be a reasonable setting, indicating that the signal type of cart
is 50 times more important for measuring a user’s interest in an item.
COLLECTION_NAME_item_recommendations
and scheduled to run after this job completes. Consequently, you should only run this aggregation once or twice a day, because training a recommender model is a complex, long-running job that requires significant resources from your Fusion cluster.
COLLECTION_NAME_user_query_history_aggregation
job computes an aggregated weight for each user/query combination found in the signals collection. The weight for each group is computed using an exponential time-decay on signal count (30 day half-life) and a weighted sum based on the signal type. Use the signalTypeWeights
parameter to set the correct signal types and weights for your dataset. You can use the results of this job to boost queries for a user based on their past query activity.
query | count_i | type | timstamp_tdt | user_id | doc_id | session_id | fusion_query_id | |
---|---|---|---|---|---|---|---|---|
Required signals fields: | ✅ | ✅ | ✅ | ✅ | ✅ |