Collections
Your data is organized into collections. When you create an app, Fusion automatically creates a collection with the same name. You can create additional collections in any app. A primary collection contains the data that your users will search. Every primary collection is associated with a set of auxiliary collections that contain related data, such as signals, aggregations, and more. Under the hood, a Fusion collection is a distributed index in Solr, defined by a named configuration stored in ZooKeeper, with these properties:- Number of shards
Documents are distributed across this number of partitions. - Document routing strategy
How documents are assigned to shards. - Replication factor
How many copies of each document in the collection. - Replica placement strategy
Where to place replicas in the cluster.
Automated Solr backups
You can schedule backups of Solr collections, store the backups for a configurable period of time, and restore these backups into a specified Fusion cluster when needed. The following guide uses Google Kubernetes Engine (GKE) for examples and assumes you used the setup scripts in the fusion-cloud-native repository to install Fusion.Solr backups using cloud provider storage options
Solr backups using cloud provider storage options
solr.xml
configuration file and ensure that the appropriate library or module for the provider is included in the Solr classpath. This step is necessary for the repository implementation to resolve correctly at runtime.For example, when configuring GCSBackupRepository
to store backups in Google Cloud Storage (GCS), it is essential to include the corresponding library for the provider in the Solr classpath. Additionally, you will need to add a section to the solr.xml
file similar to the XML example below to specify the target bucket where backups will be stored:Solr backups using Persistent Volume Claim
Solr backups using Persistent Volume Claim
ReadWriteMany
volume in Kubernetes. Most cloud providers offer a simple way of creating a shared filestore and exposing it as a PersistentVolumeClaim
within Kubernetes to mount into the Solr pods. An option is added to the setup_f5_PROVIDER.sh scripts in the fusion-cloud-native repository to provision these.The backup action of the script is invoked by a Kubernetes CronJob to run the backup schedule. The backups are saved to a configurable directory with an automatically generated name: <collection_name>-<timestamp_in_some_format>
.A separate CronJob is responsible for cleanup and retention of backups. Cleanup can be disabled if not needed. Setting a series of retention periods can automatically remove backups as they become outdated.For example, a cluster that backs up a collection every 3 hours could specify a retention policy that:- Keeps all backups for a single day.
- Keeps a single daily backup for a week.
- Keeps a single weekly backup for a month.
- Keeps a single monthly backup for 6 months.
- Deletes all backups that are older than 6 months.
configmap
for this service.The process for restoring a collection is a manual step involving kubectl run
to invoke the Solr RESTORE
action pointing to the collection and the name of the backup being restored.Install using a PVC with GKE
Thesolr-backup-runner
requires that a ReadWriteMany
volume is mounted onto all solr
and backup-runner
pods so they all back up to the same filesystem.The easiest way to install on GKE is by using a GCP Filestore as the ReadWriteMany
volume.-
Create the Filestore.
-
Fetch the IP of the Filestore.
-
Create a Persistent Volume in Kubernetes that is backed by this volume.
-
Create a Persistent Volume Claim in the namespace that Solr is running in.
-
Add the following values to your existing (or a new) Helm values file.
- Upgrade the release. Solr backups are now enabled.
Auxiliary Collections
Every primary collection is associated with a set of auxiliary collections that contain related data, such as signals, aggregations, and more. Some auxiliary collections are created for every primary collection. Others are created only for the app’s default collection, one per app. Auxiliary collections are described below:APP_NAME_job_reports | Output from Fusion experiments, Ranking Metrics jobs, and Head/Tail Analysis jobs. | 1 per app |
APP_NAME_query_rewrite | A collection of documents to use for rewriting queries, optimized for high‑volume traffic. These documents originate from the | 1 per app |
APP_NAME_query_rewrite_staging | A collection of documents created by the Rules Editor or by certain Fusion jobs, not optimized for production traffic. Documents move from this collection to the | 1 per app |
COLLECTION_NAME_signals | A search query logs and signals collection. | 1 per collection |
COLLECTION_NAME_signals_aggr | A collection for aggregated signals. | 1 per collection |
APP_NAME_user_prefs | A collection of data to support App Studio’s social features, such as user‑generated tags, bookmarks, comments, ratings, and so on. | 1 per app |
- Datasources
- Pipelines
- Profiles
- Signals and aggregations
- Analytics dashboards
System Collections
Fusion automatically creates some collections that are used for internal purposes and shared across all apps:- system_autocomplete store the content that the Fusion UI displays when you use the search bar.
- system_blobs stores blobs in Solr. This is used to store model files for the NLP components and other binary files used by Fusion components.
- system_history keeps a record of configuration changes, start and stop times for services and experiments, and more.
- system_jobs_history keeps a record of Fusion jobs, including start/stop times and status.
- system_messages is used by Fusion’s messaging services.
Collection Configuration Properties
Collections have three properties that you can configure only when you are creating a collection using the Collections API.Property | Description | Default behavior |
---|---|---|
signals* | The signals property determines whether to create auxiliary collections with suffixes _signals and _signals_aggr . | When you create a collection in the Fusion UI, signals defaults to true. When you create a collection using the Fusion API, this property defaults to false. |
searchLogs | The searchLogs property determines whether to create an auxiliary search query logs collection with suffix _logs . | When you create a collection in the Fusion UI, this property defaults to true. When you create a collection using the Fusion API, this property defaults to false. |
Using profiles to associate collections with pipelines
Index pipelines and query pipelines are not connected to a specific collection by default. Index profiles and query profiles are configurations that create consistent endpoints for indexing and querying, each with a specific pipeline and collection.- Index Profiles work with index pipelines for getting content into the system.
- Query Profiles work with query pipelines for user queries.
Field Editor UI
The Fusion UI includes a space under Collections to edit Fields. The Fusion UI includes a space under Collections to edit Fields. Descriptions for these fields can be found in the Field Type Definitions section of the Solr Reference Guide associated with your Fusion release. Field options displayed in the UI include:- Dynamic checkbox (cannot change via UI)
- Field Name (cannot change via UI)
- Field Type (a preset value is shown that can be changed using edit mode)
- Checkboxes for Indexed, Stored, Multivalued, Required
- Text field to enter a Default Value
- Copy Fields uses the plus sign to add rows (static can copy to
raw_content
ortext
; dynamic can copy to anyraw_content
/text
or any other dynamic field) - Advanced toggles checkboxes for Doc Values, Omit Norms, Omit Positions, Omit term freq and positions, Term Vectors, Term Positions, Term Offsets