Product Selector

Fusion 5.12
    Fusion 5.12

    Cloud signal storage

    By default, Fusion stores signals in Solr. Alternatively, you can choose to store signals in Google Cloud Storage or Amazon S3.

    Storing signals in the cloud reduces the amount of data stored on a Solr cluster. Signals data files are periodically compacted into larger files to save storage space, improve performance, and make it easier to manage the files.

    Cloud signal storage is enabled at the time of deployment. To enable cloud signal storage for a new Fusion deployment, you must configure this feature in the deployment options. For more information, see Enable cloud signal storage.

    All signal types are supported, including custom signal types. The default signal schema can be customized.

    Lucidworks offers free training to help you get started with Fusion. Check out the Cloud Signal Storage course, which focuses on how to use cloud signal storage in Fusion to enhance your data storage efficiency:

    Cloud Signal Storage

    Visit the LucidAcademy to see the full training catalog.

    Can I migrate existing signals to cloud signal storage?

    Currently, there is no migration path from default signal storage in Solr to cloud signal storage. This improvement is planned for a future release. To use cloud signal storage, start with a new deployment.

    Data flow

    Cloud signal storage flow

    Kafka

    Both internally and externally originated signals flow through the signals indexing pipeline. In contrast to default signals storage, when cloud signal storage is enabled, signals are not sent to Solr. Instead, signals are sent to the fusion.system.cloud-signals Kafka topic, the message system used by Fusion clouds signals to buffer signals before they are written to cloud storage.

    Signals consumer

    A signals consumer, which runs within a Kubernetes pod, polls the Kafka topic frequently and writes signals data files in Parquet format to a stable, temporary cloud storage location. Due to the write frequency, this results in a large number of small files.

    Signals compactor

    A signals compactor, which also runs within a Kubernetes pod, periodically compacts the signals data files sent by the signals consumer and writes them to a smaller number of large files. In contrast to aggregation, compaction does not change the data itself.

    The compacted signals data files are stored separately from the uncompacted signals data files. The compaction frequency is configurable, to meet the needs of your use case.

    Spark SQL aggregations

    Fusion Distributed Compute schedules and runs Spark jobs, which operate over the compacted Parquet partitions written by the signals compactor. The output is written to Solr for use at query time for boosting and recommendations.

    Cloud signal schema

    Signals are stored in Apache Parquet file format, which is designed to handle flat, column-oriented data.

    How do I read the signals data files?

    One recommended method for viewing cloud signals data is with Apache Spark.

    You can configure the Parquet schema with the following command:

    kubectl get cm FUSION_NAMESPACE-cloud-signals-config -o yaml -n FUSION_NAMESPACE

    The output of this command resembles the following:

    apiVersion: v1
    data:
      consumerProperties: |
        "auto.offset.reset": earliest
        "enable.auto.commit": false
        "max.partition.fetch.bytes": 5000000
        "max.poll.records": 10000
      jsonToAvroMapping: |
        msg_id: id
        query_t: query
      schema: |
        id:
          type: string
          required: true
        collection:
          type: string
          required: true
        type:
          type: string
          required: true
        year:
          type: int
          required: true
        month:
          type: int
          required: true
        day:
          type: int
          required: true
        query:
          type: string
          required: true
        doc_id:
          type: string
          default: ""
        filters_s:
          type: string
          defaultValue: ""
        count_i:
          type: int
          default: 1
        timestamp_tdt:
          type: string
          required: true
        fusion_query_id:
          type: string
          default: ""
        user_id:
          type: string
          default: ""
        session:
          type: string
          default: ""
      sinkProperties: |
        fs.gs.batch.threads: 4
        fs.gs.max.requests.per.batch: 4
        fs.gs.outputstream.upload.chunk.size: 8388608
        fs.s3a.buffer.dir: /tmp/buffers
        fs.s3a.fast.upload.buffer: disk
    kind: ConfigMap
    metadata:
      annotations:
        meta.helm.sh/release-name: joelb-cloud-signals
        meta.helm.sh/release-namespace: joelb
      creationTimestamp: "2023-09-06T15:48:44Z"
      labels:
        app.kubernetes.io/instance: joelb-cloud-signals
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: cloud-signals
        app.kubernetes.io/version: "5"
        helm.sh/chart: cloud-signals-0.3.1
      name: joelb-cloud-signals-config
      namespace: joelb
      resourceVersion: "2430163461"

    To configure the schema:

    1. Update the fields defined in the schema block.

    2. Save the configuration file in YAML format.

    3. Reapply the schema to the namespace.

    4. Restart the cloud signals consumer pod.

      You must restart the cloud signals consumer pod for the updated schema to take effect.
    5. Update the click signals aggregation job so that the mergeSchema is set to true.

    Known issues

    Some features in Fusion rely on signals data stored in Solr, not cloud signal storage. The following list contains features that are not functional:

    • Experiments analytics

    • Predictive Merchandiser analytics

    • Login signals

    Other features not listed here may be impacted.