Enable cloud signal storage

Table of Contents

Initial setup
Application setup

Storing signals in the cloud reduces the amount of data stored on a Solr cluster. Signals data files are periodically compacted into larger files to save storage space, improve performance, and make it easier to manage the files.

This article teaches you how to set up cloud signal storage in Google Cloud Platform (GCP) or Amazon Web Services (AWS).

For more information on cloud signal storage in Fusion, see Cloud signal storage.

There are known issues with using cloud signal storage. Before you begin, review the known issues and consider whether enabling cloud signal storage will negatively impact your Fusion environment.

Initial setup

To enable cloud signal storage, start with a new deployment. Cloud signal storage is enabled in your custom values YAML file. Enable cloud signal storage by altering the fusion-indexing service:

fusion-indexing:
  cloudSignals:
    enabled: true
    kafkaSvcUrl: {KAFKA_URL}
    topic: fusion.system.cloud-signals

Use the value {KAFKA_URL}, as written in the preceding example. This value is set by the Fusion deployment scripts.

After performing the initial setup, continue the setup specific to your cloud storage provider. Cloud signal storage is supported with Google Cloud Storage and Amazon S3.

Can I use multiple storage methods?

Fusion supports one storage location for signals data. You cannot choose to store signals in multiple places.

Google Cloud Storage

Add the following values to your custom values YAML file:

cloud-signals:
  enabledStorage:
    - gcs
  gcs:
    outputDir: "OUTPUT_DIRECTORY"
    cloudSecret: SERVICE_ACCOUNT_SECRET
  compactor:
    intervalS: 7200
    collectionExecutors: 1
    partitionExecutors: 1
    rowLimitPerFile: "1000000"
    resources: {}
    gcs:
      rootPath: COMPACTOR_ROOTPATH_DIRECTORY
      outputDir: COMPACTOR_OUTPUT_DIRECTORY

Replace the placeholder values with your environment values. See the following table for details:

Placeholder value Description

Placeholder value	Description
`OUTPUT_DIRECTORY`	The location used for uncompacted signals data files. For example, `gs://smartdata-datasets/streaming/compactor/in`.
`SERVICE_ACCOUNT_SECRET`	Your service account’s authentication secret. For more information on generating a service account secret, see Creating service account credentials.
`COMPACTOR_ROOTPATH_DIRECTORY`	The same value that you used for `OUTPUT_DIRECTORY`.
`COMPACTOR_OUTPUT_DIRECTORY`	The location used for compacted signals data files. For example, `gs://smartdata-datasets/streaming/compactor/out`.

OUTPUT_DIRECTORY

The location used for uncompacted signals data files. For example, gs://smartdata-datasets/streaming/compactor/in.

SERVICE_ACCOUNT_SECRET

Your service account’s authentication secret. For more information on generating a service account secret, see Creating service account credentials.

COMPACTOR_ROOTPATH_DIRECTORY

The same value that you used for OUTPUT_DIRECTORY.

COMPACTOR_OUTPUT_DIRECTORY

The location used for compacted signals data files. For example, gs://smartdata-datasets/streaming/compactor/out.

Configure how often you want the compactor to run and the file size for compacted signals. See the following table for configuration options:

Option Example value Description

Option	Example value	Description
`intervalS`	`7200`	The interval between compactor operations, in seconds.
`rowLimitPerFile`	`1000000`	Determines how many rows of data are written for each compacted signals data file.

intervalS

7200

The interval between compactor operations, in seconds.

rowLimitPerFile

1000000

Determines how many rows of data are written for each compacted signals data file.

Deploy Fusion using the custom values YAML file.

Amazon S3

Add the following values to your custom values YAML file:

cloud-signals:
  enabledStorage:
    - s3
  s3:
    outputDir: "OUTPUT_DIRECTORY"
    cloudSecret: SERVICE_ACCOUNT_SECRET
    region: SERVICE_REGION
  compactor:
    intervalS: 7200
    collectionExecutors: 1
    partitionExecutors: 1
    rowLimitPerFile: 1000000
    resources: {}
    s3:
      secret: SERVICE_ACCOUNT_SECRET
      keyIdFieldName: key
      secretKeyFieldName: secret
      region: SERVICE_REGION
      rootPath: COMPACTOR_ROOTPATH_DIRECTORY
      outputDir: COMPACTOR_OUTPUT_DIRECTORY

Replace the placeholder values with your environment values. See the following table for details:

Placeholder value Description

Placeholder value	Description
`OUTPUT_DIRECTORY`	The location used for uncompacted signals data files. For example, `s3a://smartdata-datasets/streaming/compactor/in`.
`SERVICE_ACCOUNT_SECRET`	Your service account’s authentication secret. For more information on generating a service account secret, see AWS Secrets Manager.
`SERVICE_REGION`	Your S3 service region. For more information, see AWS service endpoints.
`COMPACTOR_ROOTPATH_DIRECTORY`	The value that you used for `OUTPUT_DIRECTORY`, with the prefix changed from `s3a` to `s3`. For example, `s3://smartdata-datasets/streaming/compactor/in`.
`COMPACTOR_OUTPUT_DIRECTORY`	The location used for compacted signals data files. For example, `s3://smartdata-datasets/streaming/compactor/out`.

OUTPUT_DIRECTORY

The location used for uncompacted signals data files. For example, s3a://smartdata-datasets/streaming/compactor/in.

SERVICE_ACCOUNT_SECRET

Your service account’s authentication secret. For more information on generating a service account secret, see AWS Secrets Manager.

SERVICE_REGION

Your S3 service region. For more information, see AWS service endpoints.

COMPACTOR_ROOTPATH_DIRECTORY

The value that you used for OUTPUT_DIRECTORY, with the prefix changed from s3a to s3. For example, s3://smartdata-datasets/streaming/compactor/in.

COMPACTOR_OUTPUT_DIRECTORY

The location used for compacted signals data files. For example, s3://smartdata-datasets/streaming/compactor/out.

Configure how often you want the compactor to run and the file size for compacted signals. See the following table for configuration options:

Option Example value Description

Option	Example value	Description
`intervalS`	`7200`	The interval between compactor operations, in seconds.
`rowLimitPerFile`	`1000000`	Determines how many rows of data are written for each compacted signals data file.

intervalS

7200

The interval between compactor operations, in seconds.

rowLimitPerFile

1000000

Determines how many rows of data are written for each compacted signals data file.

Deploy Fusion using the custom values YAML file.

Result

If you successfully enabled cloud signal storage, your deployment has two new pods:

Pod name Description

Pod name	Description
`NAMEPSPACE-cloud-signals-gcs-POD_ID` `NAMEPSPACE-cloud-signals-s3-POD_ID`	The cloud signals consumer pod. If you are using Google Cloud Storage, the pod name contains `gcs`. If you are using Amazon S3, the pod name contains `s3` instead.
`NAMEPSPACE-compactor-gcs-POD_ID` `NAMEPSPACE-compactor-s3-POD_ID`	The cloud signals compactor pod. If you are using Google Cloud Storage, the pod name contains `gcs`. If you are using Amazon S3, the pod name contains `s3` instead.

NAMEPSPACE-cloud-signals-gcs-POD_ID
NAMEPSPACE-cloud-signals-s3-POD_ID

The cloud signals consumer pod. If you are using Google Cloud Storage, the pod name contains gcs. If you are using Amazon S3, the pod name contains s3 instead.

NAMEPSPACE-compactor-gcs-POD_ID
NAMEPSPACE-compactor-s3-POD_ID

The cloud signals compactor pod. If you are using Google Cloud Storage, the pod name contains gcs. If you are using Amazon S3, the pod name contains s3 instead.

Application setup

Create a new Fusion application and index some data.
Update the click signals aggregation job.
1. Navigate to Collections > Jobs and select the click signals aggregation job. By default, this job is named APP_NAME_click_signals_aggregation.
2. Change the Source Collection value to the path of your cloud storage location. For example: gs://smartdata-datasets/streaming/compactor/in.
  
  You might see a message that states, "This collection does not exist." This is expected with cloud signal collections. Ensure your path is correct and disregard this message.
3. Change the Data Format value to parquet.
4. Click Save to save the changes to the job.

Send signals to your application. For testing, you can send signals manually.

To send signals using the API:

curl -u USERNAME:PASSWORD -X POST \
    --url 'https://FUSION_HOST.com/api/apps/APP_NAME/signals/APP_NAME?async=false&commit=true' \
    --header 'Content-Type: application/json' \
    --header 'cache-control: no-cache' \
    --data '[
        {
          "type": "SIGNAL_TYPE",
          "params": {
            "docId": "DOCUMENT_ID",
            "count": "NUMBER_OF_SIGNALS",
            "collection": "APP_NAME",
            "query": "*:*",
            "filterQueries": []
          }
        }
        ]'

Replace placeholder values, such as APP_NAME, with your environment values. The SIGNAL_TYPE value can be any signal type, such as click, response, cart, or a custom signal type.

To send signals in the Fusion UI from the Query Workbench:
1. Navigate to Querying > Query Workbench.
2. Click Format Results.
3. Enable the Send click signals option.
4. Click Save.
5. Click a result title in the query workbench to send a signal.

Verify your signals were captured in your cloud storage.

Why is my signals collection empty?

The signals collection contains signals located in a Solr cluster. If you’re using cloud signal storage, this collection will be empty.