Upgrade Fusion 1.2 to Fusion 2.4
- Terminology
- Requirements
- Procedure
- Unpack FUSION-NEW
- Customize FUSION-NEW configuration files and run scripts
- Copy local data stores in the directory FUSION-CURRENT/data
- Migrate ZooKeeper and Solr for single-node Fusion deployment
- Migrate Fusion configurations between ZooKeeper instances
- Copying ZooKeeper data nodes
- Exporting Fusion configurations from FUSION-CURRENT ZooKeeper Service
- Importing ZooKeeper data into FUSION-NEW
- Rewrite datasource configurations for Fusion 2.4
- Rewrite stored password information used by Fusion datasources and pipelines
- Copy and convert the crawldb
- Troubleshooting the upgrade
These instructions are valid for Fusion 1.2.3 releases through Fusion 1.2.8.
Note
|
Several changes have been made to Fusion configurations stored in ZooKeeper:
To update these configurations, we have provided two python scripts which can be downloaded from: https://github.com/LucidWorks/fusion-upgrade-scripts. Once you have migrated all Fusion configurations from the current Fusion 1.2.x ZooKeeper service to the new Fusion 2.4 ZooKeeper service, you must run both of these scripts against the new ZooKeeper service. This procedure is covered in detail in section Migrate ZooKeeper data |
The upgrade process leaves the current Fusion deployment in place while a new Fusion deployment is installed and configured. All of the upgrade operations copy information from the current Fusion over to the new Fusion. This provides a rollback option should the upgrade procedure encounter problems.
The current Fusion configurations must remain as-is during the upgrade process. In order to capture indexing job history, no indexing jobs should be running. If the new Fusion installation is being installed onto the same server that the current Fusion installation is running on, you must either run only one version at a time or else change the Fusion component server ports so that all components are using unique ports for both the current and new versions.
Terminology
These instructions use the following names to refer to the directories involved in the upgrade procedure:
-
FUSION_HOME: Absolute pathname to the top-level directory of the Fusion distribution
-
FUSION-CURRENT: Name of the FUSION_HOME directory for the current Fusion version, e.g. "/opt/lucidworks/fusion-2.1.2"
-
FUSION-NEW: Name of the directory of the upgrade Fusion distribution during the upgrade process, e.g. "/opt/lucidworks/fusion-2.4.1"
-
INSTALL-DIR: Directory where the new Fusion version will be installed, e.g. "opt/lucidworks" All scripts and commands in the upgrade instruction set are carried out from this directory.
-
FUSION-UPGRADE-SCRIPTS: Full path to the directory that contains the upgrade scripts from https://github.com/LucidWorks/fusion-upgrade-scripts.
Requirements
-
File-system permissions: the user running the upgrade scripts and commands must have read/write/execute (rwx) permissions on directory INSTALL-DIR.
-
Download but do not unpack a copy of the FUSION-NEW distribution. The compressed Fusion distribution requires approximately 1.7 GB disk space. All supported version are available from Lucidworks Fusion Get Started page.
-
Disk space requirements: the INSTALL-DIR must be on a disk partition which has enough free space for the complete FUSION-NEW installation, that is, there must be at least as much free space as the size of the FUSION-CURRENT directory. On a Unix system, the following commands can be used:
-
du -sh fusion
- total size of FUSION-CURRENT. -
df -kH
- amount of free space on all file-systems.
-
-
Download a copy of the Fusion upgrade scripts from the GitHub repository https://github.com/LucidWorks/fusion-upgrade-scripts. These upgrade scripts run under Python 2.7. They have been tested with version 2.7.10. If this version of Python isn’t available, you should use Python’s virtualenv. If you don’t have permissions to install packages, you can use python to install
virtualenv
and then from yourvirtualenv
python environment, you can install your own versions of theses packages.
Note
|
These scripts require the environment variable FUSION_OLD_HOME which should
be set to the location of the current Fusion installation, i.e., the existing 1.2 or 2.1 install.
|
-
Upgrades from 2.1 to 2.4 use the script
src/upgrade-ds-2.1-to-2.4.py
. This script requires the python package kazoo which is a ZooKeeper client. -
Upgrades from 1.2 to 2.4 use two scripts:
src/upgrade-ds-1.2-to-2.4.py
andbin/download_upload_ds.py
. These scripts require python packages kazoo and requests, which is an HTTP request handler.
Procedure
Unpack FUSION-NEW
-
Current working directory must be INSTALL-DIR
The commands in this section assume that your current working directory is INSTALL-DIR (e.g., "opt/lucidworks"), thereforecd
to this directory before continuing. -
Avoid directory name conflicts between FUSION-CURRENT and FUSION-NEW
By default, the Fusion distribution unpacks into a directory named "fusion". If the INSTALL-DIR is the directory which contains the FUSION-CURRENT directory and if the FUSION-CURRENT directory is named "fusion", then you must create a new directory with a different name into which to unpack the Fusion distribution. For example, if your INSTALL-DIR is "/opt/lucidworks" and your FUSION-CURRENT directory is "/opt/lucidworks/fusion", then you should create a directory directory named "fusion-new" and unpack the contents of the distribution here:
> mkdir fusion-new > tar -C fusion-new --strip-components=1 -xf fusion-2.4.1.tar.gz
If you are working on a Windows machine, the zipfile unzips into a folder named "fusion-2.4.1" which contains a folder named "fusion". Rename folder "fusion" to "fusion-new" and move it into folder INSTALL-DIR.
Customize FUSION-NEW configuration files and run scripts
The Fusion run scripts in the FUSION_HOME/bin
directory start and stop Fusion and its component services.
The Fusion configuration files FUSION_HOME/conf
define environment variables used by the Fusion run scripts.
The configuration and run scripts for the FUSION-NEW installation must be edited by hand,
you cannot copy over existing scripts from the current installation.
The Fusion configuration scripts might need to be updated if you have changed default settings. These scripts will need to be updated for deployments that:
-
Use an external ZooKeeper cluster as Fusion’s ZooKeeper service
-
Use an external Solr cluster to manage Fusion’s system collections
-
Run on non-standard ports
-
Have been configured to run over SSL
To facilitate the task of identifying changes made to the current installation,
the FUSION-UPGRADE-SCRIPTS repository contains a directory "reference-files" which
contains copies of the contents of these directories for all Fusion releases.
To identify changes, use the Unix diff
command with the -r
flag; e.g., if FUSION-CURRENT is 2.1.1,
then these diff commands will report the set of changed files and the changes that were made:
> diff -r FUSION-CURRENT/bin FUSION-UPGRADE-SCRIPTS/reference-files/bin-2.1.1 > diff -r FUSION-CURRENT/conf FUSION-UPGRADE-SCRIPTS/reference-files/conf-2.1.1
A copy of Fusion is installed on every node in a Fusion deployment. Depending on the role that node plays in the deployment, the configuration settings and run scripts are customized accordingly. Therefore, if you are running a multi-node Fusion deployment this configuration step will be carried out for each node in the cluster.
In Fusion 1.2,
the FUSION_HOME/bin
directory contains both the Fusion run scripts and
the helper scripts which define common settings and environment variables.
In Fusion 2.1, the configuration files config.sh
and config.cmd
have been moved to directory FUSION_HOME/conf
.
Checking a 1.2 installation against the reference scripts for that release requires only a single diff command:
> diff -r FUSION-CURRENT/bin FUSION-UPGRADE-SCRIPTS/reference-files/bin-1.2.3
If
either the "config.sh" or "config.cmd" files have
changed, the corresponding files for the Fusion 2 release will be in directory FUSION_HOME/conf
.
Copy local data stores in the directory FUSION-CURRENT/data
The directory FUSION_HOME/data
contains the on-disk data stores
managed directly or indirectly by Fusion services.
-
FUSION_HOME/data/connectors
contains data required by Fusion connectors.-
FUSION_HOME/data/connectors/lucid.jdbc
contains third-party JDBC driver files. If your application uses a JDBC connector, you must copy this information over to every server on which will this connector will run. -
FUSION_HOME/data/connectors/crawldb
contains information on the filed visited during a crawl. (Preserving crawldb history may not be possible if there are multiple different servers running Fusion connectors services.)
-
-
FUSION_HOME/data/nlp
contains data used by Fusion NLP pipeline stages. If you are using Fusion’s NLP components for sentence detection, part-of-speech tagging, and named entity detection, you must copy over the model files stored under this directory. -
FUSION_HOME/data/solr
contains the backing store for Fusion’s embedded Solr (developer deployment only). -
FUSION_HOME/data/zookeeper
contains the backing store for Fusion’s embedded ZooKeeper (developer deployment only).
If FUSION_CURRENT and FUSION_NEW are installed on the same server, you can copy a subset of these directories using the Unix "cp" command, e.g.:
> cp -R FUSION-CURRENT/data/connectors/lucid.jdbc FUSION-NEW/data/connectors > cp -R FUSION-CURRENT/data/connectors/crawldb FUSION-NEW/data/connectors > cp -R FUSION-CURRENT/data/nlp FUSION-NEW/data/
If FUSION_CURRENT and FUSION_NEW are on different servers, use the Unix rsync
utility.
Migrate ZooKeeper and Solr for single-node Fusion deployment
If you are running a single-node Fusion deployment and using both the embedded ZooKeeper and the embedded Solr that ships with this distribution, then you must copy over both the configurations and data.
To copy the ZooKeeper configuration:
> mkdir -p FUSION-NEW/data/zookeeper > cp -R FUSION-CURRENT/solr/zoo_data/* FUSION-NEW/data/zookeeper
To check your work: compare the directories FUSION-CURRENT/solr/zoo_data/
and FUSION-NEW/data/zookeeper
using the diff
command. This command succeeds silently when the contents are the same.
> diff -r FUSION-CURRENT/solr/zoo_data FUSION-NEW/data/zookeeper
To copy the Solr data:
> find FUSION-CURRENT/solr -maxdepth 1 -mindepth 1 | grep -v -E "zoo*" | while read f ; do cp -R $f FUSION-NEW/data/solr/; done
If the Solr collections are very large this may take a while.
You can use the diff
command to check your work.
The copy command excluded ZooKeeper config data, therefore
you should see the following output:
> diff -r FUSION-CURRENT/solr FUSION-NEW/data/solr Only in FUSION-NEW/data/solr: configsets Only in FUSION-CURRENT/solr: zoo.cfg Only in FUSION-CURRENT/solr: zoo_data
Migrate Fusion configurations between ZooKeeper instances
Migration consists of three steps:
-
Copy the ZooKeeper data nodes which contain Fusion configuration information from the FUSION-CURRENT ZooKeeper instance to the FUSION-NEW ZooKeeper instance
Fusion’s utility script zkImportExport.sh is used to copy ZooKeeper data between ZooKeeper clusters. This script is included with all Fusion distributions in the top-level directory named
scripts
. -
Rewrite Fusion datasource configurations
Fusion 2.4 changed and standardized the configuration properties used by several datasources. The public GitHub repository https://github.com/LucidWorks/fusion-upgrade-scripts contains a python script
src/upgrade-ds-1.2-to-2.4.py
which rewrites these properties. -
Rewrite stored password information used by Fusion datasources and pipelines.
Fusion 2 encrypts all passwords use by datasources and pipelines to access password-protected data repositories. The public GitHub repository https://github.com/LucidWorks/fusion-upgrade-scripts. contains a a python script
bin/download_upload_ds_pipelines.py
used to edit the stored password information.
Copying ZooKeeper data nodes
Note
|
This step is not necessary if you are doing an in-place upgrade of a single-node Fusion deployment; the copy command described in procedure single-node Fusion ZooKeeper data (above) is sufficient. |
Fusion configurations are stored in Fusion’s ZooKeeper instance under two top-level znodes:
-
Node
lucid
stores all application-specific configurations, including collection, datasource, pipeline, signals, aggregations, and associated scheduling, jobs, and metrics. -
Node
lucid-apollo-admin
stores all access control information, including all users, groups, roles, and realms.
Fusion’s utility script zkImportExport.sh is used to migrate ZooKeeper data between ZooKeeper clusters. Migrating configuration information from one deployment to another requires running this script twice:
-
The first invocation runs the script in "export" mode, in order to get the set of configurations to be migrated as a JSON dump file.
-
The second invocation runs the script in "import" or "update" mode, in order to sent this configuration set to the other Fusion deployment.
When running this script against a Fusion deployment, it is advisable to stop all Fusion services except for Fusion’s ZooKeeper service.
Exporting Fusion configurations from FUSION-CURRENT ZooKeeper Service
The ZooKeeper service for FUSION-CURRENT must be running. Either stop all other Fusion services
or otherwise ensure that no changes to Fusion configurations take place during this procedure.
If you are upgrading from a Fusion 1.2 installation which uses Fusion’s embedded Solr service and
the ZooKeeper service included with that Solr installation, then starting just the Solr service
will start the ZooKeeper service as well.
If you are upgrading from a Fusion 2 installation, you can start just the ZooKeeper service
via the script "zookeeper" in the $FUSION_HOME/bin
directory.
The zkImportExport.sh script arguments are:
-
-cmd export
- This is the command parameter which specifies the mode in which to run this program. -
-zkhost <FUSION_CURRENT ZK>
- The ZooKeeper connect string is the list of all servers,ports for the FUSION_CURRENT ZooKeeper cluster. For example, if running a single-node Fusion developer deployment with embedded ZooKeeper, the connect string islocalhost:9983
. If you have an external 3-node ZooKeeper cluster running on servers "zk1.acme.com", "zk2.acme.com", "zk3.acme.com", all listening on port 2181, then the connect string iszk1.acme.com:2181,zk2.acme.com:2181,zk3.acme.com:2181
-
-filename <path/to/JSON/dump/file>
- The name of the JSON dump file to save to. -
-path <start znode>
-
To migrate all ZooKeeper data, the path is "/".
-
To migrate only the Fusion services configurations, the path is "/lucid". Migrating just the "lucid" node between the ZooKeeper services used by different Fusion deployments results in deployments which contain the same applications but not the same user databases.
-
To migrate the Fusion users, groups, roles, and realms information, the path is "/lucid-apollo-admin".
-
Example of exporting Fusion configurations for znode "/lucid" from a local single-node ZooKeeper service:
> $FUSION_HOME/scripts/zkImportExport.sh -zkhost localhost:9983 -cmd export -path /lucid -filename znode_lucid_dump.json
Importing ZooKeeper data into FUSION-NEW
ZooKeeper service for FUSION-NEW must be running.
To import configurations, run the zkImportExport.sh script, this time with arguments:
-
command; must be
import
-
ZooKeeper connect string for the FUSION-NEW Zookeeper cluster
-
Location of JSON dump file.
This command will fail if the "lucid" znode in this Fusion installation contains configuration definitions which are in conflict with the exported data.
Example of importing exported data from previous step into FUSION_NEW ZooKeeper running on test server 'test.acme.com':
> $FUSION_HOME/scripts/zkImportExport.sh -zkhost test.acme.com:9983 -cmd import -filename znode_lucid_dump.json
Note that the above command will fail if there is conflict between existing znode structures or contents between the ZooKeeper service and the dump file.
Rewrite datasource configurations for Fusion 2.4
Once all Fusion configurations have been uploaded to the FUSION-NEW ZooKeeper service and while that service is running, you can run the Python programs upgrade-ds-2.1-to-2.4.py or upgrade-ds-1.2-to-2.4.py to update these configurations.
Note
|
These programs require:
The Python virtualenv tool can be used to install the correct Python version and required package. |
Set environment variable "FUSION_HOME" to the full path of the FUSION-NEW directory, e.g.:
> export FUSION_HOME=/Users/demo/test_upgrade/fusion_2_4_1
Run this program with arguments: "--datasources all"
If your current Fusion version is 1.2, run:
> python upgrade-ds-1.2-to-2.4.py --datasources all
If your current Fusion is version 2, run:
> python upgrade-ds-2.1-to-2.4.py --datasources all
If a datasource wouldn’t have a valid implementation, the application will print a log message on console and continue with the next datasource.
Rewrite stored password information used by Fusion datasources and pipelines
Once you have migrated all Fusion configurations to the FUSION_NEW ZooKeeper service, you must update the migrated datasource configurations by running the script download_upload_ds_pipelines.py against the FUSION_NEW zookeeper in order to rewrite any stored datasource passwords that are specified as part of the configuration for a datasource or pipeline.
Note
|
The script The Python virtualenv tool can be used to install the correct Python version and required packages. |
The rewrite process consists of a download step which exports the ZooKeeper configuration information and an upload step which rewrites the information and then imports it back into ZooKeeper.
This script uses the following arguments and values:
-
"--zk-connect": the ZooKeeper server:port for FUSION-NEW
-
"--action": either "download" or "upload".
-
"--fusion-url": URL of Fusion API service to upload configurations to
-
"--fusion-username": name of Fusion user with admin privileges; the script will prompt for username’s password.
Download configurations from ZooKeeper
No services for FUSION-NEW should be running, except for ZooKeeper. If your Fusion installation uses an external ZooKeeper, then this must be running. If your Fusion installation uses an embedded ZooKeeper, then you must have copied the ZooKeeper data from FUSION-CURRENT to FUSION-NEW.
Start the ZooKeeper service:
> FUSION-NEW/bin/zookeeper start
Run the script to download the configurations.
> python FUSION-UPGRADE-SCRIPTS/bin/download_upload_ds_pipelines.py \ --zk-connect localhost:9983 --action download
To check your work, check that directory "fusion_upgrade_2.1" was created and that is contains definitions for all datasources and pipelines. Do not remove this directory until you have successfully completed the upload step.
If you are running embedded ZooKeeper, shut it down again:
> FUSION-NEW/bin/zookeeper stop
Upload configurations to the Fusion API service
Start FUSION-NEW:
> FUSION-NEW/bin/fusion start
Once it is running, run the script in upload mode to propagate the configurations in directory "fusion_upgrade_2.1".
At this point in the migration process, the FUSION-NEW ZooKeeper information is the same as the FUSION-CURRENT Zookeeper information; therefore the password for the admin user is the same.
To upload data to the Fusion API services, you must supply the admin username and password as arguments to the script:
-
"--fusion-username": name of Fusion user with admin privileges
-
"--fusion-password": password for Fusion user
> FUSION-NEW/bin/fusion start > python FUSION-UPGRADE-SCRIPTS/bin/download_upload_ds_pipelines.py \ --zk-connect localhost:9983 --action upload --fusion-url http://localhost:8764/api \ --fusion-username <admin>
Copy and convert the crawldb
The Fusion "crawldb" records the results of running datasource jobs. This information must be copied from FUSION-CURRENT to FUSION-NEW and the data format must be converted to the format used in Fusion 2.1 via the conversion utility com.lucidworks.fusion-crawldb-migrator-0.1.1.jar.
Copy the Fusion "crawldb" directory:
> cp -R FUSION-CURRENT/data/connectors/crawldb FUSION-NEW/data/connectors/
The crawldb data format changed in Fusion 2.1, therefore to upgrade to 2.1, the crawldb data must be converted with the the conversion utility com.lucidworks.fusion-crawldb-migrator-0.1.1.jar.
The anda-v1-to-v2
command allows Fusion 1.2.x connector DBs to be updated to the new v2.x format.
It requires:
-
A Fusion pre 2.1 installation (FUSION-CURRENT)
-
A Fusion 2.1 or later installation (FUSION-NEW).
-
All FUSION-CURRENT datasource configurations must have been migrated to FUSION-NEW (see Migrate ZooKeeper data)
-
All FUSION-CURRENT crawldb files must have been copied over to the FUSION-NEW deployment.
-
If the FUSION-NEW installation is not currently running, start it:
> FUSION-NEW/bin/fusion start
The anda-v1-to-v2
takes the following arguments:
-
path-to-FUSION-CURRENT
-
path-to-FUSION-NEW
-
the -z flag specifies the ZooKeeper server:port for FUSION-NEW
The command to run this utility from the INSTALL-DIR is:
> java -jar FUSION-UPGRADE-SCRIPTS/bin/com.lucidworks.fusion-crawldb-migrator-0.1.1.jar anda-v1-v2 fusion fusion-new -z localhost:9983
Once the task successfully completes, the last few lines of logging show the output directory of the new DB files.
The output must be copied over to FUSION-NEW.
To do this, remove the existing lucid.anda
db directories, then
copy the new lucid.anda
directories generated from this utility into that same location:
> rm -Rf FUSION-NEW/data/connectors/crawldb/lucid.anda/* > mv ${path-printed-from-command-output} FUSION-NEW/data/connectors/crawldb/lucid.anda/
This completes the upgrade process.
Troubleshooting the upgrade
-
Clear your browser cache after starting the UI in the new Fusion instance
-
The Fusion 2.4 Index Pipeline Simulator can be used to verify that the existing set of datasource configurations work as expected.