Config-sync Microservice

Config-sync is a stateless microservice used to integrate Fusion’s configuration management with GitHub.

Features:

  • Capture changes to Fusion config in centralized source control

  • Use familiar tools and workflows, including visual diff, Pull Requests, WebHooks, and email notifications

  • Enable a simple review process before promoting changes to another environment, such as stage to production

  • Prevent manual changes in production environments; all changes are applied using code and API calls are driven by changes to a git branch

  • Extensible: Integrate with other tools like Spinnaker, Argo CD, or Jenkins (apply config changes as part of a CI pipeline)

  • Ability to rollback to a previous version using Git revert

So why not just store all config in Git to begin with and do away with Zookeeper?

While replacing ZK as the centralized config store with Git/GitHub seems appealing on the surface, it’s not practical due to several reasons:

  • GitHub rate limits! If the limits were exceeded for whatever reason, you would not be able to save config changes in Fusion, which obviously would not work

  • Need a fast, centralized store for all services to contact at any time; accessing GitHub is just not fast enough

  • Watchers: Zookeeper can notify many watchers of a change concurrently whereas we’d need to implement some other kind of distributed notification mechanism for changes made to GitHub; this is doable but at some point we’d probably just end up using ZK for this anyway.

Example Workflow

The following diagram depicts the basic workflow to propagate config changes from a staging to production environment. In this scenario, Fusion operators need to designate a primary source environment where they make changes via the API or Fusion Admin UI (staging on left side of diagram); there can be multiple target environments (production on right side) but only one primary source that drives the update process. Typically, you only need to deploy a single pod in a cluster. Pods are stateless and simply clone the remote repo to initialize a local Git repo during startup. All the "state" is stored in GitHub.

config sync workflow
  1. The service supports two modes: publisher and subscriber. At step 1, a Fusion Admin user sends updates to the Staging cluster (publisher) for Fusion objects via the Admin UI or API (for example, to edit Query Pipeline or add an App). The Gateway service routes API requests to the appropriate microservice in Fusion; CRUD operations on Fusion objects are persisted in Zookeeper with a JSON payload. For example, an update to a Query Pipeline gets sent from the Gateway to the Query Pipeline service, which persists the change in Zookeeper (stored as JSON).

  2. The config-sync microservice running in publisher mode on the Staging cluster uses a Curator ZK TreeCache to "watch" for ZNode updates. When a ZNode update event occurs, config-sync processes the update and associates the object with one or more apps being watched. If the update is not linked to an app that is being watched by the service, it is simply ignored.

  3. Updates are committed to a local git repo (clone of the remote). PUSH to remote occurs on a background schedule that looks for pending commits every 5 secs (configurable). A slight delay is needed because PUSH operations to GitHub are slow and typically take several seconds to process. In addition, PUSH operations are synchronized so that only one thread can update the repo at a time.

  4. At some time in the future, Fusion admin / ops engineers review the changes in the publisher branch (stage). When they are ready to apply the changes, they simply merge the updates into the subscriber branch (prod). This process can be automated as needed using various GitHub tools, such as reviewer notification emails.

    review changes
  5. The config-sync service running in subscriber mode in the target cluster either polls GitHub for updates every 30 seconds (the frequency is configurable but be cognizant of GitHub rate limits) or uses WebHooks to receive push notifications (preferred). Updates that come in from the PULL operation are filtered, sorted, and sent to the config-sync Pulsar topic. Using a persistent topic to queue the updates ensures that updates will be applied if a service is down temporarily. For example, if the query pipeline service is down (for whatever reason), then an update to a Query Pipeline will be retried until the service is back online. Of course, it should be a rare event that a service would be unavailable in a production environment, especially for critical services like query and indexing. Consequently, a simple retry from a persistent topic should suffice.

  6. The config-sync topic consumer (runs in process in the config-sync service) receives update messages, applies any substitution vars to the JSON document, and then invokes the appropriate API to perform an update.

  7. Config-sync uses the API to apply updates instead of just writing JSON directly to ZK. This helps ensure updates are validated. If an update fails after so many retries, it is sent to a Dead-Letter-Queue (DLQ) to be manually resolved by a Fusion admin.

Fusion operators can also use the config-sync service to keep multiple clusters in-sync, such as a multi-region deployment. The following diagram depicts a multi-region scenario.

mutli region deployment

The config-sync service deployed to the mrr-us-west1 cluster runs in publisher mode where it watches for config updates and commits them to the multi-region branch in GitHub. The config-sync service deployed to the mrr-us-central1 cluster runs in subscriber mode where it pulls from the multi-region branch for updates and then applies them to the app. In a real production environment where you have a staging cluster, you’ll likely run the staging cluster in publisher mode and then run the config-sync service in subscriber mode on both production clusters.

Init Containers

Config-sync depends on Zookeeper, Pulsar, Admin, and the Proxy services being online before it initializes.

Publisher Mode

In publisher mode, config-sync watches Zookeeper for changes to Fusion config objects and then saves the changes to a GitHub branch, such as stage. Changes are organized by the Fusion app. Thus, to get started, you need to set up a GitHub repository and a GitHub OAuth token.

The pub.apps setting lets you restrict the apps that are monitored by the service. For instance, there may be many apps in a Fusion cluster, but you may only want to monitor a subset for changes. The pub.git.* settings control the remote GitHub repo, branch, and path within the repo.

To trigger the config-sync service to sync with a new app, such as dcommerce, you run:

curl -u $CREDS -XPUT "https://FUSION_URL/config-sync/pub/sync/dcommerce"

Once synced, any changes to the app in the Fusion admin UI (or API) will get picked and committed to the branch in GitHub.

By default, the publisher pushes pending commits in the local repo every 5 seconds (pub.git.pushPendingChangesFrequencyMs); a PUSH operation is typically slow and can take several seconds to complete. PUSH is synchronized so only one thread sends updates to GitHub at a time. Because of this brief time interval between PUSH operations, we run the risk of an update being lost if the service fails and the local Git repo is not stored in a persistent volume.

However, during initialization, the publisher receives a ZK watch event for every object in Fusion. Thus, it’s not a major risk if the config-sync service fails before it publishes all updates as it will re-publish them to the repo when it re-initializes after failure. Moreover, the publisher supports a "sync" operation to ensure the branch in GitHub reflects the current state of an app in Fusion, thus allowing operators to "heal" the repo before they propagate changes out to a production environment.

Repo Organization

Config-sync supports a pathInRepo setting to set a sub-directory in the git repo; default is root "/". This allows sharing the same GitHub repo for different clusters. Under the "root" directory in the repo is a sub-directory for each app. Under each app, there is a directory for each object type, such as queryPipelines. The following screenshot shows an example of a GitHub repo that uses /test as the pathInRepo in the fusion-config-sync-test repo:

repo organization

Config-sync uses the Fusion Links API to determine app association for Fusion objects. However, it does not track links in GitHub as it uses the directory structure to maintain app association for an object.

Watched Objects

The following table lists the object types supported by the config-sync service in the order they are processed, as well as any dependencies on other object types. Config-sync applies updates in the following order: DELETE, CREATE, UPDATE

Object Type

Depends On

Blobs*

Collections

Changes to system collections not propagated

Solr config set (solrconfig.xml, schema)

Query Pipeline

Experiment

Query Profile

Query Pipeline, Experiment, Collection

Index Pipeline

Subscription (Pulsar)

Parser

Index Profile

Index Pipeline, Subscription, Collection

Spark Jobs

Data sources (connectors)

Tasks

Jobs (schedules)

Spark Jobs, Datasources, Tasks

Features, search clusters, links, AppKit apps, and custom rules are not supported in 5.2. We recommend establishing a baseline of the app configuration in the publisher environment (stage) before enabling config sync with another target environment (prod). For instance, if you need recommendations, enable the recommendation feature in the publisher environment and then export the app to GitHub. You can always run a restore operation against the publisher to export the app ZIP to GitHub after making major changes. It is uncommon to have multiple search clusters in Fusion 5, but if you do, you need to configure these for each environment manually using the API.

Fusion operators can configure which types to synchronize per app using the pub.apps config setting. Types can either be excluded or included. For instance, to exclude Spark jobs for the demo app and only include query pipelines for the app1 app, you would do:

config-sync:
  pub:
    apps: "demo: -sparkJobs; app1: queryPipelines"

Blobs are stored in Solr and not Zookeeper, so there is no ZK watch event fired when a blob is added, deleted, or updated. The publisher runs a background task that queries the blob store for changes every 30 seconds (configurable via pub.blob.pollForUpdatesMs). Alternatively, Fusion operators can disable blobs and manually migrate them using the API.

You may also want to disable job schedules and tasks as it’s very common to run jobs on a different schedule in production than you do in staging.

Subscriber Mode

In subscriber mode, config-sync polls a specific branch in GitHub for updates; you can also disable background polling and rely on REST API calls to trigger a PULL operation or a WebHook (push notification from GitHub). For instance, to propagate changes from a staging environment to a production cluster, you simply merge changes from the stage branch to the prod branch and the config-sync service will apply the changes.

Changes are applied using REST API calls based on the type of object, for example an update to a query pipeline translates into a PUT call sent to the query service.

Using the API to apply changes ensures:

  1. All the appropriate validation occurs before the update is applied

  2. Avoids having to keep track of link objects outside Fusion

  3. Lets Fusion trigger all the necessary dependent changes as needed

  4. Better error handling - if something goes wrong, the API may be able to correct itself

What happens if the config-sync subscriber is down when an update to the subscriber branch is applied?

First, the subscriber keeps track of the latest Git SHA1 hash it has applied to the target environment (stored in ZK). Thus, if the subscriber branch changes while the service is offline, when it re-initializes, it compares the stored Git SHA1 with the latest it gets from a PULL operation during startup.

Restore Operation

When running in subscriber mode, config-sync supports a restore operation for an app.

curl -u $CREDS -XPUT "https://FUSION_URL/config-sync/sub/restore/dcommerce"

Restore imports an app from an export ZIP stored in the repo if the app doesn’t exist, or pulls all diffs from the current state of the subscriber branch (such as prod) and a stored last known SHA1 if available. If no SHA1 is available for an app, the subscriber walks the git repo filesystem and applies updates to all objects for the app.

Disaster Recovery Support

The config-sync service (in subscriber mode) can restore all apps from a Git repo during initialization. This implies that it can be used to restore a Fusion cluster from source control in the event an entire cluster is lost.

Manual PULL

By default, the subscriber polls the branch for updates every 15 seconds (configurable using sub.git.pullRemoteUpdatesMs). However, Fusion operators can disable this background process and instead trigger a PULL using an API call. This approach is useful for automated CD type environments (think Spinnaker) where administrators want to integrate updates with a more complex automation environment. To trigger a manual PULL, simply send a PUT request to the /sub endpoint in the service.

curl -u $CREDS -XPOST "https://FUSION_URL/config-sync/sub"

WebHook

GitHub can send a push notification to the config-sync (via the Gateway) using a WebHook, see: https://developer.github.com/webhooks/. https://jira.lucidworks.com/browse/APOLLO-25450

Environment specific Substitution Variables

One of the challenges of promoting config changes from one environment to another is the need to apply environment specific config changes. For instance, the JDBC database URL would be different for staging and production. Config-sync provides a very simple mechanism to parameterize environment specific settings. In subscriber mode, before applying changes to the target environment, config-sync matches substitution variables from four possible variables files:

REPO_ROOT
|__ vars.json
|__ <branch>_vars.json
|__ <APP>
     |__ vars.json
     |__ <branch>_vars.json

Each variable has an optional glob path matcher (https://javapapers.com/java/glob-with-java-nio/) and JSON Path component (https://www.baeldung.com/guide-to-jayway-jsonpath). Variables from the <branch>_vars.json file take precedence over variables from the other files. For instance, the following variable will set any stage property named secret to value STAGE_SECRET for the query pipeline with ID "lab4":

 {
    "key": "/queryPipelines/lab4:$.stages.*.secret",
    "value": "STAGE_SECRET"
  }

Here’s an example of a variable that does not have a path component, which means it will be evaluated for every object processed by the subscriber before it makes the API call:

 {
    "key": "$.stages.*.paramToTag",
    "value": "QUERY"
 }

Any stage that has the paramToTag property will get updated to the value QUERY.

The value for substitution variables can also come from any Kubernetes secret mounted into the config-sync service pod filesystem or from an environment variable. This allows Fusion administrators to avoid storing sensitive information in plain text in GitHub.

A K8s secret is a container of key value pairs. Each key value pair can be mounted as file (key) in a directory (secret name) on the config-sync pod. For background on K8s secrets, see: https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/

 {
   "key": "$.stages.*.dbPassword",
   "type": "secret",
   "secretPath": "/etc/secret-volume/db-password"
 }

In this example, the dbPassword setting for any stage is populated from a K8s secret mounted in the pod at /etc/secret-volume/db-password. The secret would be created like:

kubectl create secret generic db-creds --from-literal=db-password='39528$vdg7Jb'

A K8s operator would mount the secret in the pod by doing:

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: config-sync
      volumeMounts:
        # name must match the volume name below
        - name: secret-volume
          mountPath: /etc/secret-volume
  # The secret data is exposed to Containers in the Pod through a Volume.
  volumes:
    - name: secret-volume
      secret:
        secretName: db-creds

Alternatively, you can provide the secret using an env var and secretEnv property in the vars.json file.

Ordered Updates

The subscriber orders updates Fusion objects based on the update type (DELETE, CREATE, UPDATE) and resource type (changes to index pipelines applied before index profiles). For example, a Query Profile can refer to a Query Pipeline and optionally an Experiment. Thus, operations are performed on Query Pipelines and Experiments before Query Profiles.

However, Fusion does NOT have the concept of a distributed transaction. Consequently, it is not possible to apply all updates as an atomic unit of work. This means there is a risk of some updates from a PULL not being applied. The Config-sync service tries to mitigate this using the Pulsar topic with a configurable re-try count (pulsar.client.maxRetries, default is 30 with a 10 second delay between requests (pulsar.client.ackNegativeRedeliveryDelay)).

Rollback

The subscriber simply pulls updates from the GitHub branch. Thus, to "rollback", a Fusion admin simply needs to push a change to the branch to set the desired state for Fusion objects. In other words, the subscriber doesn’t have the concept of rollback but applies the changes based on the current HEAD of the branch it watches.

App Deletion

Config-sync (in subscriber mode) does not delete apps! In other words, if an app that was being maintained in the GitHub branch is no longer present, config-sync does not delete the app from the target environment. This is done for safety reasons and to allow removing an app from being synced without deleting the app in Fusion. Fusion operators can manually delete apps in the target environment using the API or Fusion UI.