Cloud signal storage
By default, Fusion stores signals in Solr. Alternatively, you can choose to store signals in Google Cloud Storage or Amazon S3.
Storing signals in the cloud reduces the amount of data stored on a Solr cluster. Signals data files are periodically compacted into larger files to save storage space, improve performance, and make it easier to manage the files.
Cloud signal storage is enabled at the time of deployment. To enable cloud signal storage for a new Fusion deployment, you must configure this feature in the deployment options. For more information, see Enable cloud signal storage.
All signal types are supported, including custom signal types. The default signal schema can be customized.
Lucidworks offers free training to help you get started with Fusion. Check out the Cloud Signal Storage course, which focuses on how to use cloud signal storage in Fusion to enhance your data storage efficiency: Visit the LucidAcademy to see the full training catalog. |
Can I migrate existing signals to cloud signal storage?
Currently, there is no migration path from default signal storage in Solr to cloud signal storage. This improvement is planned for a future release. To use cloud signal storage, start with a new deployment. |
Data flow
- Kafka
-
Both internally and externally originated signals flow through the signals indexing pipeline. In contrast to default signals storage, when cloud signal storage is enabled, signals are not sent to Solr. Instead, signals are sent to the fusion.system.cloud-signals Kafka topic, the message system used by Fusion clouds signals to buffer signals before they are written to cloud storage.
- Signals consumer
-
A signals consumer, which runs within a Kubernetes pod, polls the Kafka topic frequently and writes signals data files in Parquet format to a stable, temporary cloud storage location. Due to the write frequency, this results in a large number of small files.
- Signals compactor
-
A signals compactor, which also runs within a Kubernetes pod, periodically compacts the signals data files sent by the signals consumer and writes them to a smaller number of large files. In contrast to aggregation, compaction does not change the data itself.
The compacted signals data files are stored separately from the uncompacted signals data files. The compaction frequency is configurable, to meet the needs of your use case.
- Spark SQL aggregations
-
Fusion Distributed Compute schedules and runs Spark jobs, which operate over the compacted Parquet partitions written by the signals compactor. The output is written to Solr for use at query time for boosting and recommendations.
Cloud signal schema
Signals are stored in Apache Parquet file format, which is designed to handle flat, column-oriented data.
How do I read the signals data files?
One recommended method for viewing cloud signals data is with Apache Spark. |
You can configure the Parquet schema with the following command:
kubectl get cm FUSION_NAMESPACE-cloud-signals-config -o yaml -n FUSION_NAMESPACE
The output of this command resembles the following:
apiVersion: v1
data:
consumerProperties: |
"auto.offset.reset": earliest
"enable.auto.commit": false
"max.partition.fetch.bytes": 5000000
"max.poll.records": 10000
jsonToAvroMapping: |
msg_id: id
query_t: query
schema: |
id:
type: string
required: true
collection:
type: string
required: true
type:
type: string
required: true
year:
type: int
required: true
month:
type: int
required: true
day:
type: int
required: true
query:
type: string
required: true
doc_id:
type: string
default: ""
filters_s:
type: string
defaultValue: ""
count_i:
type: int
default: 1
timestamp_tdt:
type: string
required: true
fusion_query_id:
type: string
default: ""
user_id:
type: string
default: ""
session:
type: string
default: ""
sinkProperties: |
fs.gs.batch.threads: 4
fs.gs.max.requests.per.batch: 4
fs.gs.outputstream.upload.chunk.size: 8388608
fs.s3a.buffer.dir: /tmp/buffers
fs.s3a.fast.upload.buffer: disk
kind: ConfigMap
metadata:
annotations:
meta.helm.sh/release-name: joelb-cloud-signals
meta.helm.sh/release-namespace: joelb
creationTimestamp: "2023-09-06T15:48:44Z"
labels:
app.kubernetes.io/instance: joelb-cloud-signals
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: cloud-signals
app.kubernetes.io/version: "5"
helm.sh/chart: cloud-signals-0.3.1
name: joelb-cloud-signals-config
namespace: joelb
resourceVersion: "2430163461"
To configure the schema:
-
Update the fields defined in the
schema
block. -
Save the configuration file in YAML format.
-
Reapply the schema to the namespace.
-
Restart the cloud signals consumer pod.
You must restart the cloud signals consumer pod for the updated schema to take effect. -
Update the click signals aggregation job so that the
mergeSchema
is set totrue
.
Known issues
Some features in Fusion rely on signals data stored in Solr, not cloud signal storage. The following list contains features that are not functional:
-
Experiments analytics
-
Predictive Merchandiser analytics
-
Login signals
Other features not listed here may be impacted.