Upgrade Fusion 1.2 to Fusion 2.4

These instructions are valid for Fusion 1.2.3 releases through Fusion 1.2.8.

Note

Several changes have been made to Fusion configurations stored in ZooKeeper:

  • Fusion 2.1 introduced enhanced security for Fusion datasource passwords which are stored in ZooKeeper as part of datasource and pipeline stage configuration properties.

  • Fusion 2.4 introduces changes to the configuration properties for some Fusion datasources.

To update these configurations, we have provided two python scripts which can be downloaded from: https://github.com/LucidWorks/fusion-upgrade-scripts.

Once you have migrated all Fusion configurations from the current Fusion 1.2.x ZooKeeper service to the new Fusion 2.4 ZooKeeper service, you must run both of these scripts against the new ZooKeeper service. This procedure is covered in detail in section Migrate ZooKeeper data

The upgrade process leaves the current Fusion deployment in place while a new Fusion deployment is installed and configured. All of the upgrade operations copy information from the current Fusion over to the new Fusion. This provides a rollback option should the upgrade procedure encounter problems.

The current Fusion configurations must remain as-is during the upgrade process. In order to capture indexing job history, no indexing jobs should be running. If the new Fusion installation is being installed onto the same server that the current Fusion installation is running on, you must either run only one version at a time or else change the Fusion component server ports so that all components are using unique ports for both the current and new versions.

Terminology

These instructions use the following names to refer to the directories involved in the upgrade procedure:

  • FUSION_HOME: Absolute pathname to the top-level directory of the Fusion distribution

  • FUSION-CURRENT: Name of the FUSION_HOME directory for the current Fusion version, e.g. "/opt/lucidworks/fusion-2.1.2"

  • FUSION-NEW: Name of the directory of the upgrade Fusion distribution during the upgrade process, e.g. "/opt/lucidworks/fusion-2.4.1"

  • INSTALL-DIR: Directory where the new Fusion version will be installed, e.g. "opt/lucidworks" All scripts and commands in the upgrade instruction set are carried out from this directory.

  • FUSION-UPGRADE-SCRIPTS: Full path to the directory that contains the upgrade scripts from https://github.com/LucidWorks/fusion-upgrade-scripts.

Requirements

  • File-system permissions: the user running the upgrade scripts and commands must have read/write/execute (rwx) permissions on directory INSTALL-DIR.

  • Download but do not unpack a copy of the FUSION-NEW distribution. The compressed Fusion distribution requires approximately 1.7 GB disk space. All supported version are available from Lucidworks Fusion Get Started page.

  • Disk space requirements: the INSTALL-DIR must be on a disk partition which has enough free space for the complete FUSION-NEW installation, that is, there must be at least as much free space as the size of the FUSION-CURRENT directory. On a Unix system, the following commands can be used:

    • du -sh fusion - total size of FUSION-CURRENT.

    • df -kH - amount of free space on all file-systems.

  • Download a copy of the Fusion upgrade scripts from the GitHub repository https://github.com/LucidWorks/fusion-upgrade-scripts. These upgrade scripts run under Python 2.7. They have been tested with version 2.7.10. If this version of Python isn’t available, you should use Python’s virtualenv. If you don’t have permissions to install packages, you can use python to install virtualenv and then from your virtualenv python environment, you can install your own versions of theses packages.

Note
These scripts require the environment variable FUSION_OLD_HOME which should be set to the location of the current Fusion installation, i.e., the existing 1.2 or 2.1 install.
  • Upgrades from 2.1 to 2.4 use the script src/upgrade-ds-2.1-to-2.4.py. This script requires the python package kazoo which is a ZooKeeper client.

  • Upgrades from 1.2 to 2.4 use two scripts: src/upgrade-ds-1.2-to-2.4.py and bin/download_upload_ds.py. These scripts require python packages kazoo and requests, which is an HTTP request handler.

Procedure

Unpack FUSION-NEW

  • Current working directory must be INSTALL-DIR
    The commands in this section assume that your current working directory is INSTALL-DIR (e.g., "opt/lucidworks"), therefore cd to this directory before continuing.

  • Avoid directory name conflicts between FUSION-CURRENT and FUSION-NEW
    By default, the Fusion distribution unpacks into a directory named "fusion". If the INSTALL-DIR is the directory which contains the FUSION-CURRENT directory and if the FUSION-CURRENT directory is named "fusion", then you must create a new directory with a different name into which to unpack the Fusion distribution. For example, if your INSTALL-DIR is "/opt/lucidworks" and your FUSION-CURRENT directory is "/opt/lucidworks/fusion", then you should create a directory directory named "fusion-new" and unpack the contents of the distribution here:

> mkdir fusion-new
> tar -C fusion-new --strip-components=1 -xf fusion-2.4.1.tar.gz

If you are working on a Windows machine, the zipfile unzips into a folder named "fusion-2.4.1" which contains a folder named "fusion". Rename folder "fusion" to "fusion-new" and move it into folder INSTALL-DIR.

Customize FUSION-NEW configuration files and run scripts

The Fusion run scripts in the FUSION_HOME/bin directory start and stop Fusion and its component services. The Fusion configuration files FUSION_HOME/conf define environment variables used by the Fusion run scripts. The configuration and run scripts for the FUSION-NEW installation must be edited by hand, you cannot copy over existing scripts from the current installation.

The Fusion configuration scripts might need to be updated if you have changed default settings. These scripts will need to be updated for deployments that:

  • Use an external ZooKeeper cluster as Fusion’s ZooKeeper service

  • Use an external Solr cluster to manage Fusion’s system collections

  • Run on non-standard ports

  • Have been configured to run over SSL

To facilitate the task of identifying changes made to the current installation, the FUSION-UPGRADE-SCRIPTS repository contains a directory "reference-files" which contains copies of the contents of these directories for all Fusion releases. To identify changes, use the Unix diff command with the -r flag; e.g., if FUSION-CURRENT is 2.1.1, then these diff commands will report the set of changed files and the changes that were made:

> diff -r FUSION-CURRENT/bin FUSION-UPGRADE-SCRIPTS/reference-files/bin-2.1.1
> diff -r FUSION-CURRENT/conf FUSION-UPGRADE-SCRIPTS/reference-files/conf-2.1.1

A copy of Fusion is installed on every node in a Fusion deployment. Depending on the role that node plays in the deployment, the configuration settings and run scripts are customized accordingly. Therefore, if you are running a multi-node Fusion deployment this configuration step will be carried out for each node in the cluster.

In Fusion 1.2, the FUSION_HOME/bin directory contains both the Fusion run scripts and the helper scripts which define common settings and environment variables. In Fusion 2.1, the configuration files config.sh and config.cmd have been moved to directory FUSION_HOME/conf.

Checking a 1.2 installation against the reference scripts for that release requires only a single diff command:

> diff -r FUSION-CURRENT/bin FUSION-UPGRADE-SCRIPTS/reference-files/bin-1.2.3

If either the "config.sh" or "config.cmd" files have changed, the corresponding files for the Fusion 2 release will be in directory FUSION_HOME/conf.

Copy local data stores in the directory FUSION-CURRENT/data

The directory FUSION_HOME/data contains the on-disk data stores managed directly or indirectly by Fusion services.

  • FUSION_HOME/data/connectors contains data required by Fusion connectors.

    • FUSION_HOME/data/connectors/lucid.jdbc contains third-party JDBC driver files. If your application uses a JDBC connector, you must copy this information over to every server on which will this connector will run.

    • FUSION_HOME/data/connectors/crawldb contains information on the filed visited during a crawl. (Preserving crawldb history may not be possible if there are multiple different servers running Fusion connectors services.)

  • FUSION_HOME/data/nlp contains data used by Fusion NLP pipeline stages. If you are using Fusion’s NLP components for sentence detection, part-of-speech tagging, and named entity detection, you must copy over the model files stored under this directory.

  • FUSION_HOME/data/solr contains the backing store for Fusion’s embedded Solr (developer deployment only).

  • FUSION_HOME/data/zookeeper contains the backing store for Fusion’s embedded ZooKeeper (developer deployment only).

If FUSION_CURRENT and FUSION_NEW are installed on the same server, you can copy a subset of these directories using the Unix "cp" command, e.g.:

> cp -R FUSION-CURRENT/data/connectors/lucid.jdbc FUSION-NEW/data/connectors
> cp -R FUSION-CURRENT/data/connectors/crawldb FUSION-NEW/data/connectors
> cp -R FUSION-CURRENT/data/nlp FUSION-NEW/data/

If FUSION_CURRENT and FUSION_NEW are on different servers, use the Unix rsync utility.

Migrate ZooKeeper and Solr for single-node Fusion deployment

If you are running a single-node Fusion deployment and using both the embedded ZooKeeper and the embedded Solr that ships with this distribution, then you must copy over both the configurations and data.

To copy the ZooKeeper configuration:

> mkdir -p FUSION-NEW/data/zookeeper
> cp -R FUSION-CURRENT/solr/zoo_data/* FUSION-NEW/data/zookeeper

To check your work: compare the directories FUSION-CURRENT/solr/zoo_data/ and FUSION-NEW/data/zookeeper using the diff command. This command succeeds silently when the contents are the same.

> diff -r FUSION-CURRENT/solr/zoo_data FUSION-NEW/data/zookeeper

To copy the Solr data:

> find FUSION-CURRENT/solr -maxdepth 1 -mindepth 1 | grep -v -E "zoo*" | while read f ; do cp -R $f FUSION-NEW/data/solr/; done

If the Solr collections are very large this may take a while.

You can use the diff command to check your work. The copy command excluded ZooKeeper config data, therefore you should see the following output:

> diff -r FUSION-CURRENT/solr FUSION-NEW/data/solr
Only in FUSION-NEW/data/solr: configsets
Only in FUSION-CURRENT/solr: zoo.cfg
Only in FUSION-CURRENT/solr: zoo_data

Migrate Fusion configurations between ZooKeeper instances

Migration consists of three steps:

  • Copy the ZooKeeper data nodes which contain Fusion configuration information from the FUSION-CURRENT ZooKeeper instance to the FUSION-NEW ZooKeeper instance

    Fusion’s utility script zkImportExport.sh is used to copy ZooKeeper data between ZooKeeper clusters. This script is included with all Fusion distributions in the top-level directory named scripts.

  • Rewrite Fusion datasource configurations

    Fusion 2.4 changed and standardized the configuration properties used by several datasources. The public GitHub repository https://github.com/LucidWorks/fusion-upgrade-scripts contains a python script src/upgrade-ds-1.2-to-2.4.py which rewrites these properties.

  • Rewrite stored password information used by Fusion datasources and pipelines.

    Fusion 2 encrypts all passwords use by datasources and pipelines to access password-protected data repositories. The public GitHub repository https://github.com/LucidWorks/fusion-upgrade-scripts. contains a a python script bin/download_upload_ds_pipelines.py used to edit the stored password information.

Copying ZooKeeper data nodes

Note
This step is not necessary if you are doing an in-place upgrade of a single-node Fusion deployment; the copy command described in procedure single-node Fusion ZooKeeper data (above) is sufficient.

Fusion configurations are stored in Fusion’s ZooKeeper instance under two top-level znodes:

  • Node lucid stores all application-specific configurations, including collection, datasource, pipeline, signals, aggregations, and associated scheduling, jobs, and metrics.

  • Node lucid-apollo-admin stores all access control information, including all users, groups, roles, and realms.

Fusion’s utility script zkImportExport.sh is used to migrate ZooKeeper data between ZooKeeper clusters. Migrating configuration information from one deployment to another requires running this script twice:

  • The first invocation runs the script in "export" mode, in order to get the set of configurations to be migrated as a JSON dump file.

  • The second invocation runs the script in "import" or "update" mode, in order to sent this configuration set to the other Fusion deployment.

When running this script against a Fusion deployment, it is advisable to stop all Fusion services except for Fusion’s ZooKeeper service.

Exporting Fusion configurations from FUSION-CURRENT ZooKeeper Service

The ZooKeeper service for FUSION-CURRENT must be running. Either stop all other Fusion services or otherwise ensure that no changes to Fusion configurations take place during this procedure. If you are upgrading from a Fusion 1.2 installation which uses Fusion’s embedded Solr service and the ZooKeeper service included with that Solr installation, then starting just the Solr service will start the ZooKeeper service as well. If you are upgrading from a Fusion 2 installation, you can start just the ZooKeeper service via the script "zookeeper" in the $FUSION_HOME/bin directory.

The zkImportExport.sh script arguments are:

  • -cmd export - This is the command parameter which specifies the mode in which to run this program.

  • -zkhost <FUSION_CURRENT ZK> - The ZooKeeper connect string is the list of all servers,ports for the FUSION_CURRENT ZooKeeper cluster. For example, if running a single-node Fusion developer deployment with embedded ZooKeeper, the connect string is localhost:9983. If you have an external 3-node ZooKeeper cluster running on servers "zk1.acme.com", "zk2.acme.com", "zk3.acme.com", all listening on port 2181, then the connect string is zk1.acme.com:2181,zk2.acme.com:2181,zk3.acme.com:2181

  • -filename <path/to/JSON/dump/file> - The name of the JSON dump file to save to.

  • -path <start znode>

    • To migrate all ZooKeeper data, the path is "/".

    • To migrate only the Fusion services configurations, the path is "/lucid". Migrating just the "lucid" node between the ZooKeeper services used by different Fusion deployments results in deployments which contain the same applications but not the same user databases.

    • To migrate the Fusion users, groups, roles, and realms information, the path is "/lucid-apollo-admin".

Example of exporting Fusion configurations for znode "/lucid" from a local single-node ZooKeeper service:

> $FUSION_HOME/scripts/zkImportExport.sh -zkhost localhost:9983 -cmd export -path /lucid -filename znode_lucid_dump.json

Importing ZooKeeper data into FUSION-NEW

ZooKeeper service for FUSION-NEW must be running.

To import configurations, run the zkImportExport.sh script, this time with arguments:

  • command; must be import

  • ZooKeeper connect string for the FUSION-NEW Zookeeper cluster

  • Location of JSON dump file.

This command will fail if the "lucid" znode in this Fusion installation contains configuration definitions which are in conflict with the exported data.

Example of importing exported data from previous step into FUSION_NEW ZooKeeper running on test server 'test.acme.com':

> $FUSION_HOME/scripts/zkImportExport.sh -zkhost test.acme.com:9983 -cmd import -filename znode_lucid_dump.json

Note that the above command will fail if there is conflict between existing znode structures or contents between the ZooKeeper service and the dump file.

Rewrite datasource configurations for Fusion 2.4

Once all Fusion configurations have been uploaded to the FUSION-NEW ZooKeeper service and while that service is running, you can run the Python programs upgrade-ds-2.1-to-2.4.py or upgrade-ds-1.2-to-2.4.py to update these configurations.

Note

These programs require:

  • The environment variable "FUSION_HOME" must be set to the FUSION-NEW directory.

  • The environment variable "FUSION_OLD_HOME" must be set to the FUSION-CURRENT directory.

  • Python version 2.7, preferably version 2.7.10.

  • Package: kazoo - a ZooKeeper client

The Python virtualenv tool can be used to install the correct Python version and required package.

Set environment variable "FUSION_HOME" to the full path of the FUSION-NEW directory, e.g.:

> export FUSION_HOME=/Users/demo/test_upgrade/fusion_2_4_1

Run this program with arguments: "--datasources all"

If your current Fusion version is 1.2, run:

> python upgrade-ds-1.2-to-2.4.py --datasources all

If your current Fusion is version 2, run:

> python upgrade-ds-2.1-to-2.4.py --datasources all

If a datasource wouldn’t have a valid implementation, the application will print a log message on console and continue with the next datasource.

Rewrite stored password information used by Fusion datasources and pipelines

Once you have migrated all Fusion configurations to the FUSION_NEW ZooKeeper service, you must update the migrated datasource configurations by running the script download_upload_ds_pipelines.py against the FUSION_NEW zookeeper in order to rewrite any stored datasource passwords that are specified as part of the configuration for a datasource or pipeline.

Note

The script bin/download_upload_ds_pipelines.py requires:

  • Python version 2.7, preferably version 2.7.10.

  • Package: kazoo - a ZooKeeper client

  • Package: requests - an HTTP request handler

  • Environment variable FUSION_OLD_HOME set to location of Fusion 1.2 home.

The Python virtualenv tool can be used to install the correct Python version and required packages.

The rewrite process consists of a download step which exports the ZooKeeper configuration information and an upload step which rewrites the information and then imports it back into ZooKeeper.

This script uses the following arguments and values:

  • "--zk-connect": the ZooKeeper server:port for FUSION-NEW

  • "--action": either "download" or "upload".

  • "--fusion-url": URL of Fusion API service to upload configurations to

  • "--fusion-username": name of Fusion user with admin privileges; the script will prompt for username’s password.

Download configurations from ZooKeeper

No services for FUSION-NEW should be running, except for ZooKeeper. If your Fusion installation uses an external ZooKeeper, then this must be running. If your Fusion installation uses an embedded ZooKeeper, then you must have copied the ZooKeeper data from FUSION-CURRENT to FUSION-NEW.

Start the ZooKeeper service:

> FUSION-NEW/bin/zookeeper start

Run the script to download the configurations.

> python FUSION-UPGRADE-SCRIPTS/bin/download_upload_ds_pipelines.py \
 --zk-connect localhost:9983 --action download

To check your work, check that directory "fusion_upgrade_2.1" was created and that is contains definitions for all datasources and pipelines. Do not remove this directory until you have successfully completed the upload step.

If you are running embedded ZooKeeper, shut it down again:

> FUSION-NEW/bin/zookeeper stop
Upload configurations to the Fusion API service

Start FUSION-NEW:

> FUSION-NEW/bin/fusion start

Once it is running, run the script in upload mode to propagate the configurations in directory "fusion_upgrade_2.1".

At this point in the migration process, the FUSION-NEW ZooKeeper information is the same as the FUSION-CURRENT Zookeeper information; therefore the password for the admin user is the same.

To upload data to the Fusion API services, you must supply the admin username and password as arguments to the script:

  • "--fusion-username": name of Fusion user with admin privileges

  • "--fusion-password": password for Fusion user

> FUSION-NEW/bin/fusion start
> python FUSION-UPGRADE-SCRIPTS/bin/download_upload_ds_pipelines.py \
 --zk-connect localhost:9983 --action upload --fusion-url http://localhost:8764/api \
 --fusion-username <admin>

Copy and convert the crawldb

The Fusion "crawldb" records the results of running datasource jobs. This information must be copied from FUSION-CURRENT to FUSION-NEW and the data format must be converted to the format used in Fusion 2.1 via the conversion utility com.lucidworks.fusion-crawldb-migrator-0.1.0.jar.

Copy the Fusion "crawldb" directory:

> cp -R FUSION-CURRENT/data/connectors/crawldb FUSION-NEW/data/connectors/

The crawldb data format changed in Fusion 2.1, therefore to upgrade to 2.1, the crawldb data must be converted with the the conversion utility com.lucidworks.fusion-crawldb-migrator-0.1.0.jar.

The anda-v1-to-v2 command allows Fusion 1.2.x connector DBs to be updated to the new v2.x format. It requires:

  • A Fusion pre 2.1 installation (FUSION-CURRENT)

  • A Fusion 2.1 or later installation (FUSION-NEW).

    • All FUSION-CURRENT datasource configurations must have been migrated to FUSION-NEW (see Migrate ZooKeeper data)

    • All FUSION-CURRENT crawldb files must have been copied over to the FUSION-NEW deployment.

If the FUSION-NEW installation is not currently running, start it:

> FUSION-NEW/bin/fusion start

The anda-v1-to-v2 takes the following arguments:

  • path-to-FUSION-CURRENT

  • path-to-FUSION-NEW

  • the -z flag specifies the ZooKeeper server:port for FUSION-NEW

The command to run this utility from the INSTALL-DIR is:

> java -jar FUSION-UPGRADE-SCRIPTS/bin/com.lucidworks.fusion-crawldb-migrator-0.1.0.jar anda-v1-v2 fusion fusion-new -z localhost:9983

Once the task successfully completes, the last few lines of logging show the output directory of the new DB files. The output must be copied over to FUSION-NEW. To do this, remove the existing lucid.anda db directories, then copy the new lucid.anda directories generated from this utility into that same location:

> rm -Rf FUSION-NEW/data/connectors/crawldb/lucid.anda/*
> mv ${path-printed-from-command-output} FUSION-NEW/data/connectors/crawldb/lucid.anda/

This completes the upgrade process.

Troubleshooting the upgrade

  • Clear your browser cache after starting the UI in the new Fusion instance

  • The Fusion 2.4 Index Pipeline Simulator can be used to verify that the existing set of datasource configurations work as expected.