Cloud signal storage

Table of Contents

Data flow
Cloud signal schema
Known issues

By default, Fusion stores signals in Solr. Alternatively, you can choose to store signals in Google Cloud Storage or Amazon S3.

Storing signals in the cloud reduces the amount of data stored on a Solr cluster. Signals data files are periodically compacted into larger files to save storage space, improve performance, and make it easier to manage the files.

Cloud signal storage is enabled at the time of deployment. To enable cloud signal storage for a new Fusion deployment, you must configure this feature in the deployment options.

All signal types are supported, including custom signal types. The default signal schema can be customized.

Lucidworks offers free training to help you get started.

The Course for Cloud Signal Storage focuses on how to use cloud signal storage in Fusion to enhance your data storage efficiency:

Visit the LucidAcademy to see the full training catalog.

Can I migrate existing signals to cloud signal storage?

Currently, there is no migration path from default signal storage in Solr to cloud signal storage. This improvement is planned for a future release. To use cloud signal storage, start with a new deployment.

Data flow

Cloud signal storage flow

Kafka: Both internally and externally originated signals flow through the signals indexing pipeline. In contrast to default signals storage, when cloud signal storage is enabled, signals are not sent to Solr. Instead, signals are sent to the fusion.system.cloud-signals Kafka topic, the message system used by Fusion clouds signals to buffer signals before they are written to cloud storage.

Signals consumer: A signals consumer, which runs within a Kubernetes pod, polls the Kafka topic frequently and writes signals data files in Parquet format to a stable, temporary cloud storage location. Due to the write frequency, this results in a large number of small files.

Signals compactor: A signals compactor, which also runs within a Kubernetes pod, periodically compacts the signals data files sent by the signals consumer and writes them to a smaller number of large files. In contrast to aggregation, compaction does not change the data itself.

The compacted signals data files are stored separately from the uncompacted signals data files. The compaction frequency is configurable, to meet the needs of your use case.

Spark SQL aggregations: Fusion Distributed Compute schedules and runs Spark jobs, which operate over the compacted Parquet partitions written by the signals compactor. The output is written to Solr for use at query time for boosting and recommendations.

Cloud signal schema

Signals are stored in Apache Parquet file format, which is designed to handle flat, column-oriented data.

How do I read the signals data files?

One recommended method for viewing cloud signals data is with Apache Spark.

You can configure the Parquet schema with the following command:

kubectl get cm FUSION_NAMESPACE-cloud-signals-config -o yaml -n FUSION_NAMESPACE

The output of this command resembles the following:

apiVersion: v1
data:
  consumerProperties: |
    "auto.offset.reset": earliest
    "enable.auto.commit": false
    "max.partition.fetch.bytes": 5000000
    "max.poll.records": 10000
  jsonToAvroMapping: |
    msg_id: id
    query_t: query
  schema: |
    id:
      type: string
      required: true
    collection:
      type: string
      required: true
    type:
      type: string
      required: true
    year:
      type: int
      required: true
    month:
      type: int
      required: true
    day:
      type: int
      required: true
    query:
      type: string
      required: true
    doc_id:
      type: string
      default: ""
    filters_s:
      type: string
      defaultValue: ""
    count_i:
      type: int
      default: 1
    timestamp_tdt:
      type: string
      required: true
    fusion_query_id:
      type: string
      default: ""
    user_id:
      type: string
      default: ""
    session:
      type: string
      default: ""
  sinkProperties: |
    fs.gs.batch.threads: 4
    fs.gs.max.requests.per.batch: 4
    fs.gs.outputstream.upload.chunk.size: 8388608
    fs.s3a.buffer.dir: /tmp/buffers
    fs.s3a.fast.upload.buffer: disk
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: joelb-cloud-signals
    meta.helm.sh/release-namespace: joelb
  creationTimestamp: "2023-09-06T15:48:44Z"
  labels:
    app.kubernetes.io/instance: joelb-cloud-signals
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: cloud-signals
    app.kubernetes.io/version: "5"
    helm.sh/chart: cloud-signals-0.3.1
  name: joelb-cloud-signals-config
  namespace: joelb
  resourceVersion: "2430163461"

To configure the schema:

Update the fields defined in the schema block.
Save the configuration file in YAML format.
Reapply the schema to the namespace.
Restart the cloud signals consumer pod.

You must restart the cloud signals consumer pod for the updated schema to take effect.
Update the click signals aggregation job so that the mergeSchema is set to true.

Known issues

Some features in Fusion rely on signals data stored in Solr, not cloud signal storage. The following list contains features that are not functional:

Experiments analytics
Predictive Merchandiser analytics
Login signals

Other features not listed here may be impacted.