High Availability and Disaster Recovery

High availability

High system availability is sustained by sufficient levels of redundancy in the processing stack. Factors to consider when determining system availability include:

Redundant resource costs
System performance
System synchronization methods and the level of synchronization each method provides
Computational costs
Network latency

Redundancy levels available are:

Data Redundancy. Create multiple copies of the data on a single storage device to increase the probability of at least one copy being available at all times.
Storage Redundancy. Distribute the data across multiple physical storage devices to ensure data is available if one of the physical storage devices is not accessible.
Process/Pod Redundancy. Create multiple copies of each service/pod to reduce the risk of a service becoming unavailable if an individual copy fails for some reason.
Process/Pod Distribution. Distribute duplicate pods across different nodes in a cluster to protect against the loss of service due to a failed node in the cluster.
Node Redundancy. Create multiple nodes in the cluster within a single Availability Zone to protect against the loss of any single node. For this option:
- The level of node over-provisioning is a function of both the probability of a node failing, and the average time required to provision and return a new node to service after a failure.
- Sufficient redundancy means that if one node fails, the remaining nodes have sufficient computing capacity to manage the load while a new node is provisioned.
Availability Zone Redundancy. Distribute resources across multiple zones to allow the system to tolerate the loss of an entire zone without compromising the availability of the system as a whole. While this is an option, Lucidworks does not use this because it does not consider this best practice.
Cluster Redundancy. - Create multiple copies at the cluster level within a single data center to ensure that there is no loss of service even if an entire cluster is lost.
Data Center Redundancy. Distribute clusters across multiple data centers in a region. If availability to a specific data center is lost, the system is still available in other data centers.
Region Redundancy. Distribute clusters across multiple regions, for example, east and west. If a major outage occurs and availability to a specific region is lost, the system is still available in the other regions.

Query runtime options

The following options describe Fusion configurations.

ImportantThe options are listed in order, from the least highly-available and lower disaster-prepared state, to the optimal highly-available and highest disaster-prepared state.

Single cluster located in on-premise client hardware. This provides the lowest level of availability and disaster recovery if an outage or data corruption occurs.
Single cluster deployed in the cloud. This provides the lowest cloud-based level of availability and disaster recovery if an outage or data corruption occurs.
Active-Passive disaster recovery. Redundant clusters exist, but only one is running at a given time.
Green-Blue-Active-Passive disaster recovery. Redundant clusters exist, but only one is available to service user requests.
Active-Active disaster recovery. Redundant clusters exist that contain identical live infrastructure and all clusters are actively servicing user requests. For more information about the configuration to synchronize indexes between data clusters, see Active - Active.
Green-Blue Active-Active disaster recovery. Redundant clusters exist that contain duplicate Kubernetes-deployed environments and all clusters are actively servicing user requests.
Fully Active-Active Traffic Shaped disaster recovery. - Redundant clusters exist and all clusters are actively servicing user requests, with traffic directed to a specific cluster according to geography or user affinity.

Fusion High Availability architecture types

Architecture types depend on the number of nodes in a cluster.

Nodes in cluster	Services on node
All services on 3+ nodes	All services run on each node.
All services on 2+1 nodes	● All services run on each node. ● Requires an additional node to run only Zookeeper.
Deploy services separately on 10 nodes	● 3 nodes for only search services (admin-ui, api, Zookeeper) ● 2 nodes for only Solr service (search engine) ● 2 nodes for signals services (Spark master and worker) ● 2 nodes for connectors services

Fusion Disaster Recovery deployment architecture types

Disaster Recovery uses independent Fusion clusters on both data centers. The key tasks are to synchronize:

data center configurations
the actual data (indexes)

Strategy to sync data center configuration

Extract configurations from lower environment and move to higher environment. During this process we ensure configurations apply to high environments for both data centers.

Setups to sync indexes between data centers

Active - Active

Hot - Hot: The data centers connect to a common node for good network bandwidth.
1. Fusion clusters are on both data centers without Solr.
2. Deploy Solr service on the common node.
3. Both Fusion clusters connect to Solr node to GET/PUT catalog or signals records.
4. Solr index has its own backup and restore procedure.
Hot - Warm: There is no common data node between the data centers.
1. Deploy Fusion full stack clusters on both data centers.
2. Use Fusion connectors or PBL job to transfer catalog and signals from one datacenter to another.
3. Solr index is already replicated on both data centers.
4. During data transfer, it is possible that data is not fully synchronized.

Active - Passive

Hot - Cold
1. Fusion has 1-2 nodes on the secondary datacenter.
2. Configurations are periodically copied over to the secondary data center.
3. Connectors job is set up to copy indexes.
4. Could also take periodic Solr Active cluster backups and cache them temporarily so the most recent backup is always available to restore into the Cold cluster when it starts.

Data transfer options

Fusion Parallel Bulk Loader (PBL) job

For the Fusion Parallel Bulk Loader (PBL) job:

Pros include:
- Easy to configure Fusion Spark job.
- Blazing fast batch transfer with scheduling and monitoring options available.
- Configurable to filter out documents.
Cons include:
- None, because this option does not need to run Spark services.

Solr streaming expressions

Here is an example of Solr streaming expressions:

daemon(id="uniqueId",
    runInterval="5000",
    terminate="false",
    commit(Target,
 batchSize=5000,
 update(Target,
     batchSize=500,
     topic(
  Source-Checkpoint-Coll,
  Source,
                zkHost="localhost:9983/solr",
                q="*:*",
                fq="mimeType_s:application/pdf",
    fl="id,*",
                sort="id asc",
                qt="/export"
      )
         )
     )
)

Pros include:
- Keeps track of what has been read to date and only sends what is new.
- Open source and easy to configure.
- Does not depend on shard or replica count, and will tolerate differences on either end of the transaction.
- Tolerates source-side outages and picks up where it left off.
Cons include:
- Can only send stored fields.
- Not parallelizable like Spark jobs.
- Does not tolerate destination-side outages. When the target is not available, data transfers can drop because the source-side high-water-mark may move before the target-side failure is detected.

Automated Solr backups

You can schedule backups of Solr collections, store the backups for a configurable period of time, and restore these backups into a specified Fusion cluster when needed. The following guide uses Google Kubernetes Engine (GKE) for examples and assumes you used the setup scripts in the fusion-cloud-native repository to install Fusion.

Solr backups using cloud provider storage options

The standard approach of using a provider-specific Persistent Volume Claim (PVC) for storing collection backups ensures consistency in configuration. However, this method does not leverage the unique storage features offered by each cloud provider. For instance, Google Cloud provides Google Cloud Storage, which includes additional features such as access control management, various storage tiers, and other capabilities that are not available when using a PVC. To take advantage of these features, Solr instances running within Fusion require additional provider-specific information.Refer to the Solr documentation for detailed information on the repositories available for configuring collection backups. Each repository type comes with specific configuration options and features. Generally, you will need to integrate the provider-specific configuration into the solr.xml configuration file and ensure that the appropriate library or module for the provider is included in the Solr classpath. This step is necessary for the repository implementation to resolve correctly at runtime.For example, when configuring GCSBackupRepository to store backups in Google Cloud Storage (GCS), it is essential to include the corresponding library for the provider in the Solr classpath. Additionally, you will need to add a section to the solr.xml file similar to the XML example below to specify the target bucket where backups will be stored:

<backup>
  <repository name="gcs_backup" class="org.apache.solr.gcs.GCSBackupRepository" default="false">
    <str name="gcsBucket">solrBackups</str>
    <str name="gcsCredentialPath">/local/path/to/credential/file</str>
    <str name="location">/default/gcs/backup/location</str>
    <int name="gcsClientMaxRetries">5</int>
    <int name="gcsClientHttpInitialRetryDelayMillis">1500</int>
    <double name="gcsClientHttpRetryDelayMultiplier">1.5</double>
    <int name="gcsClientHttpMaxRetryDelayMillis">10000</int>
  </repository>
</backup>

After configuring the backup provider, you can utilize the standard Solr backup and restore APIs to create new backups or restore from existing ones. Instead of writing to a PVC, backups will be stored in the storage solution specific to the provider.

Solr backups using Persistent Volume Claim

Backups are taken using the Solr collection BACKUP command. This requires that each Solr node has access to a shared volume or a ReadWriteMany volume in Kubernetes. Most cloud providers offer a simple way of creating a shared filestore and exposing it as a PersistentVolumeClaim within Kubernetes to mount into the Solr pods. An option is added to the setup_f5_PROVIDER.sh scripts in the fusion-cloud-native repository to provision these.The backup action of the script is invoked by a Kubernetes CronJob to run the backup schedule. The backups are saved to a configurable directory with an automatically generated name: <collection_name>-<timestamp_in_some_format>.A separate CronJob is responsible for cleanup and retention of backups. Cleanup can be disabled if not needed. Setting a series of retention periods can automatically remove backups as they become outdated.For example, a cluster that backs up a collection every 3 hours could specify a retention policy that:

Keeps all backups for a single day.
Keeps a single daily backup for a week.
Keeps a single weekly backup for a month.
Keeps a single monthly backup for 6 months.
Deletes all backups that are older than 6 months.

All times are configurable as part of the configmap for this service.The process for restoring a collection is a manual step involving kubectl run to invoke the Solr RESTORE action pointing to the collection and the name of the backup being restored.

These instructions are for GKE only. For other platforms, backup and restoration involves copying the collection to the cloud and using Parallel Bulk Loader.

Install using a PVC with GKE

The solr-backup-runner requires that a ReadWriteMany volume is mounted onto all solr and backup-runner pods so they all back up to the same filesystem.The easiest way to install on GKE is by using a GCP Filestore as the ReadWriteMany volume.

Create the Filestore.

gcloud --project "${GCLOUD_PROJECT}" filestore instances create "${NFS_NAME}"  --tier=STANDARD --file-share=name="solrbackups,capacity=${SOLR_BACKUP_NFS_GB}GB" --zone="${GCLOUD_ZONE}" --network=name="${network_name}"

Fetch the IP of the Filestore.

NFS_IP="$(gcloud filestore instances describe "${NFS_NAME}" --project="${GCLOUD_PROJECT}" --zone="${GCLOUD_ZONE}" --format="value(networks.ipAddresses[0])")"

Create a Persistent Volume in Kubernetes that is backed by this volume.

cat <<EOF | kubectl -n "${NAMESPACE}" apply -f -
apiVersion: v1
kind: PersistentVolume
metadata:
 name: ${NAMESPACE}-solr-backups
 annotations:
   pv.beta.kubernetes.io/gid: "8983"
spec:
 capacity:
   storage: ${SOLR_BACKUP_NFS_GB}G
 accessModes:
   - ReadWriteMany
 nfs:
   path: /solrbackups
   server: ${NFS_IP}
EOF

Create a Persistent Volume Claim in the namespace that Solr is running in.

cat <<EOF | kubectl -n "${NAMESPACE}" apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: fusion-solr-backup-claim
spec:
 volumeName: ${NAMESPACE}-solr-backups
 accessModes:
   - ReadWriteMany
 storageClassName: ""
 resources:
   requests:
     storage: ${SOLR_BACKUP_NFS_GB}G
EOF

Add the following values to your existing (or a new) Helm values file.

solr-backup-runner:
 enabled: true
 sharedPersistentVolumeName: fusion-solr-backup-claim
solr:
 additionalInitContainers:
   - name: chown-backup-directory
     securityContext:
       runAsUser: 0
     image: busybox:latest
     command: ['/bin/sh', '-c', "owner=$(stat -c '%u' /mnt/solr-backups);  if [ ! \"${owner}\" = \"8983\" ]; then chown -R 8983:8983 /mnt/solr-backups; fi "]
     volumeMounts:
       - mountPath: /mnt/solr-backups
         name: solr-backups
 additionalVolumes:
   - name: solr-backups
     persistentVolumeClaim:
       claimName: fusion-solr-backup-claim
 additionalVolumeMounts:
   - name: solr-backups
     mountPath: "/mnt/solr-backups"

Upgrade the release. Solr backups are now enabled.

Get Started

Introduction to Fusion

Getting Data In

Getting Data Out

Operations

Reference

Developer Docs

Neural Hybrid Search

Release Notes