Skip to main content

Configuration files

Fusion configuration files are stored in https://FUSION_HOST:FUSION_PORT/conf/ (on Unix or macOS) or fusion\4.2.x\conf\ (on Windows). The contents of this directory are as follows:
fusion.propertiesFusion’s main configuration file, which defines the common environment variables used by the Fusion run scripts.

Many of the values in this file can be set using environment variables, enabling you to set them using systemd, Docker, and so on. Default values are also provided. For example, in api.port = ${API_PORT:-8765}, the value is 8765 unless API_PORT is defined.

NOTE: ZOOKEEPER_PORT cannot not be used, and value of zookeeper.port in fusion.properties must be the same as the value of clientPort in conf/zookeeper/zoo.cfg.

Many of the values in this file can be set using environment variables, enabling you to set them using systemd, Docker, and so on. Default values are also provided. For example, in api.port = ${API_PORT:-8765}, the value is 8765 unless API_PORT is defined.

NOTE: ZOOKEEPER_PORT cannot not be used, and value of zookeeper.port in fusion.cors must be the same as the value of clientPort in conf/zookeeper/zoo.cfg.
zookeeper/commons-logging.properties zookeeper/zoo.cfgZooKeeper configuration files.
agent-log4j2.xml api-log4j2.xml connectors-log4j2.xml solr-log4j2.xml spark-driver-log4j2.xml spark-master-agent-log4j2.xml spark-master-log4j2.xml spark-worker-agent-log4j2.xml spark-worker-log4j2.xml sql-agent-log4j2.xml sql-log4j2.xml ui-log4j2.xml zk-log4j2.xml zookeeper/log4j2.xml zookeeper/log4j.propertiesLogging configuration files.

Fusion uses the Apache Log4j 2 logging framework with Jetty. Log levels, frequencies, and log rotation policy can be configured by changing these configuration files. See the Log4j2 Configuration guide.
hive-site.xmlConfiguration for Fusion to Import Data with Hive.
Fusion ships with a Serializer/Deserializer (SerDe) for Hive, included in the distribution as lucidworks-hive-serde-v2.2.6.jar in $FUSION_HOME/apps/connectors/resources/lucid.hadoop/jobs.
For Fusion 4.1.x and 4.2.x, the preferred method of importing data with Hive is to use the Parallel Bulk Loader. The import procedure does not apply to Fusion 5.x.

Prerequisites

This section covers prerequisites and background knowledge needed to help you understand the structure of this document and how the Fusion installation process works with Kubernetes.

Release Name and Namespace

Before installing Fusion, you need to choose a https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/Kubernetes namespace to install Fusion into. Think of a K8s namespace as a virtual cluster within a physical cluster. You can install multiple instances of Fusion in the same cluster in separate namespaces. However, please do not install more than one Fusion release in the same namespace.
All Fusion services must run in the same namespace, i.e. you should not try to split a Fusion cluster across multiple namespaces.__
Use a short name for the namespace, containing only letters, digits, or dashes (no dots or underscores). The setup scripts in this repo use the namespace for the Helm release name by default.

Install Helm

Helm is a package manager for Kubernetes that helps you install and manage applications on your Kubernetes cluster. Regardless of which Kubernetes platform you’re using, you need to install helm as it is required to install Fusion for any K8s platform. On MacOS, you can do:
brew install kubernetes-helm
If you already have helm installed, make sure you’re using the latest version:
brew upgrade kubernetes-helm
For other OS, please refer to the Helm installation docs: https://helm.sh/docs/using_helm/The Fusion helm chart requires that helm is greater than version 3.0.0; check your Helm version by running helm version --short.

Helm User Permissions

If you require that fusion is installed by a user with minimal permissions, instead of an admin user, then the role and cluster role that will have to be assigned to the user within the namespace that you wish to install fusion in are documented in the install-roles directory.
When working with Kubernetes on the command-line, it’s useful to create a shell alias for kubectl, e.g.:
alias k=kubectl
To use these role in a cluster, as an admin user first create the namespace that you wish to install fusion into:
k create namespace fusion-namespace
Apply the role.yaml and cluster-role.yaml files to that namespace
k apply -f cluster-role.yaml
k config set-context --current --namespace=$NAMESPACE
k apply -f role.yaml
Then bind the rolebinding and clusterolebinding to the install user:
k create --namespace fusion-namespace rolebinding fusion-install-rolebinding --role fusion-installer --user <install_user>
k create clusterrolebinding fusion-install-rolebinding --clusterrole fusion-installer --user <install_user>
You will then be able to run the helm install command as the <install_user>

Clone fusion-cloud-native from GitHub

You should clone this repo from github as you’ll need to run the scripts on your local workstation:
git clone https://github.com/lucidworks/fusion-cloud-native.git
You should get into the habit of pulling this repo for the latest changes before performing any maintenance operations on your Fusion cluster to ensure you have the latest updates to the scripts.
cd fusion-cloud-native
git pull
Cloning the github repo is preferred so that you can pull in updates to the scripts, but if you are not a git user, then you can download the project: https://github.com/lucidworks/fusion-cloud-native/archive/master.zip. Once downloaded, extract the zip and cd into the fusion-cloud-native-master directory.

Prerequisites

This section covers prerequisites and background knowledge needed to help you understand the structure of this document and how the Fusion installation process works with Kubernetes.

Release Name and Namespace

Before installing Fusion, you need to choose a https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/Kubernetes namespace to install Fusion into. Think of a K8s namespace as a virtual cluster within a physical cluster. You can install multiple instances of Fusion in the same cluster in separate namespaces. However, please do not install more than one Fusion release in the same namespace.
All Fusion services must run in the same namespace, i.e. you should not try to split a Fusion cluster across multiple namespaces.__
Use a short name for the namespace, containing only letters, digits, or dashes (no dots or underscores). The setup scripts in this repo use the namespace for the Helm release name by default.

Install Helm

Helm is a package manager for Kubernetes that helps you install and manage applications on your Kubernetes cluster. Regardless of which Kubernetes platform you’re using, you need to install helm as it is required to install Fusion for any K8s platform. On MacOS, you can do:
brew install kubernetes-helm
If you already have helm installed, make sure you’re using the latest version:
brew upgrade kubernetes-helm
For other OS, please refer to the Helm installation docs: https://helm.sh/docs/using_helm/The Fusion helm chart requires that helm is greater than version 3.0.0; check your Helm version by running helm version --short.

Helm User Permissions

If you require that fusion is installed by a user with minimal permissions, instead of an admin user, then the role and cluster role that will have to be assigned to the user within the namespace that you wish to install fusion in are documented in the install-roles directory.
When working with Kubernetes on the command-line, it’s useful to create a shell alias for kubectl, e.g.:
alias k=kubectl
To use these role in a cluster, as an admin user first create the namespace that you wish to install fusion into:
k create namespace fusion-namespace
Apply the role.yaml and cluster-role.yaml files to that namespace
k apply -f cluster-role.yaml
k config set-context --current --namespace=$NAMESPACE
k apply -f role.yaml
Then bind the rolebinding and clusterolebinding to the install user:
k create --namespace fusion-namespace rolebinding fusion-install-rolebinding --role fusion-installer --user <install_user>
k create clusterrolebinding fusion-install-rolebinding --clusterrole fusion-installer --user <install_user>
You will then be able to run the helm install command as the <install_user>

Clone fusion-cloud-native from GitHub

You should clone this repo from github as you’ll need to run the scripts on your local workstation:
git clone https://github.com/lucidworks/fusion-cloud-native.git
You should get into the habit of pulling this repo for the latest changes before performing any maintenance operations on your Fusion cluster to ensure you have the latest updates to the scripts.
cd fusion-cloud-native
git pull
Cloning the github repo is preferred so that you can pull in updates to the scripts, but if you are not a git user, then you can download the project: https://github.com/lucidworks/fusion-cloud-native/archive/master.zip. Once downloaded, extract the zip and cd into the fusion-cloud-native-master directory.

Hive SerDe

This project includes tools to build a Hive SerDe to index Hive tables to Solr.

Features

  • Index Hive table data to Solr.
  • Read Solr index data to a Hive table.
  • Kerberos support for securing communication between Hive and Solr.
  • As of v2.2.4 of the SerDe, integration with http://lucidworks.com/products is supported. *Fusion’s index pipelines can be used to index data to Fusion. *Fusion’s query pipelines can be used to query Fusion’s Solr instance for data to insert into a Hive table.
This version of hive-solr supports Hive 3.0.0. For support for Hive 1.x, see the hive_1x branch.
hive-solr should only be used with Solr 5.0 and higher.

Build the SerDe Jar

This project has a dependency on the solr-hadoop-common submodule (contained in a separate GitHub repository, https://github.com/lucidworks/solr-hadoop-common). This submodule must be initialized before building the SerDe .jar.
You must use Java 8 or higher to build the .jar.
To initialize the submodule, pull this repo, then:
git submodule init
Once the submodule has been initialized, the command git submodule update will fetch all the data from that project and check out the appropriate commit listed in the superproject. You must initialize and update the submodule before attempting to build the SerDe jar.
  • If a build is happening from a branch, make sure that solr-hadoop-common is pointing to the correct SHA. (See https://github.com/blog/2104-working-with-submodules for more details.)
    hive-solr $ git checkout <branch-name>
    hive-solr $ cd solr-hadoop-common
    hive-solr/solr-hadoop-common $ git checkout <SHA>
    hive-solr/solr-hadoop-common $ cd ..
    
The build uses Gradle. However, you do not need to have Gradle already installed before attempting to build.To build the jar files, run this command:
./gradlew clean shadowJar --info
This will build a single .jar file, solr-hive-serde/build/libs/{packageUser}-hive-serde-{connectorVersion}.jar, which can be used with Hive v3.0. Other Hive versions (such as v2.x) may work with this jar, but have not been tested.

Troubleshooting Clone Issues

If GitHub and SSH are not configured the following exception will be thrown:
    Cloning into 'solr-hadoop-common'...
    Permission denied (publickey).
    fatal: Could not read from remote repository.  
    Please make sure you have the correct access rights
    and the repository exists.
    fatal: clone of 'git@github.com:LucidWorks/solr-hadoop-common.git' into submodule path 'solr-hadoop-common' failed
To fix this error, use the https://help.github.com/articles/generating-an-ssh-key/ tutorial.

Add the SerDe Jar to Hive Classpath

In order for the Hive SerDe to work with Solr, the SerDe jar must be added to Hive’s classpath using the hive.aux.jars.path capability. There are several options for this, described below.It’s considered a best practice to use a single directory for all auxiliary jars you may want to add to Hive so you only need to define a single path. However, you must then copy any jars you want to use to that path.
The following options all assume you have created such a directory at /usr/hive/auxlib; if you use another path, update the path in the examples accordingly.
If you use Hive with Ambari (as with the Hortonworks HDP distribution), go to menu:Hive[Configs > Advanced], and scroll down to menu:Advanced hive-env[hive-env template]. Find the section where the HIVE_AUX_JARS_PATH is defined, and add the path to each line which starts with export. What you want will end up looking like:
# Folder containing extra libraries required for hive compilation/execution can be controlled by:
if [ "${HIVE_AUX_JARS_PATH}" != "" ]; then
  if [ -f "${HIVE_AUX_JARS_PATH}" ]; then
    export HIVE_AUX_JARS_PATH=${HIVE_AUX_JARS_PATH},/usr/hive/auxlib
  elif [ -d "/usr/hdp/current/hive-webhcat/share/hcatalog" ]; then
    export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar,/usr/hive/auxlib
  fi
elif [ -d "/usr/hdp/current/hive-webhcat/share/hcatalog" ]; then
  export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar,/usr/hive/auxlib
fi
If not using Ambari or similar cluster management tool, you can add the jar location to hive/conf/hive-site.xml:
<property>
  <name>hive.aux.jars.path</name>
  <value>/usr/hive/auxlib</value>
</property>
Another option is to launch Hive with the path defined with the auxpath variable: hive —auxpath /usr/hive/auxlibThere are also other approaches that could be used. Keep in mind, though, that the jar must be loaded into the classpath, adding it with the ADD JAR function is not sufficient.

Indexing Data with a Hive External Table

Indexing data to Solr or Fusion requires creating a Hive external table. An external table allows the data in the table to be used (read or write) by another system or application outside of Hive.

Indexing Data to Solr

For integration with Solr, the external table allows you to have Solr read from and write to Hive.To create an external table for Solr, you can use a command similar to below. The properties available are described after the example.
hive> CREATE EXTERNAL TABLE solr (id string, field1_s string, field2_i int) <1>
      STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' <2>
      LOCATION '/tmp/solr' <3>
      TBLPROPERTIES('solr.server.url' = 'http://localhost:8888/solr', <4>
                    'solr.collection' = 'collection1',
                    'solr.query' = '*:*');
<1> In this example, we have created an external table named “solr”, and defined a set of fields and types for the data we will store in the table. See the section <<Defining Fields for Solr>> below for best practices when naming fields.<2> This defines a custom storage handler (STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'), which is one of the classes included with the Hive SerDe jar.<3> The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/solr.<4> In the section TBLPROPERTIES, we define several parameters for Solr so the data can be indexed to the right Solr installation and collection. See the section <<Table Properties>> below for details about these parameters.If the table needs to be dropped at a later time, you can use the DROP TABLE command in Hive. This will remove the metadata stored in the table in Hive, but will not modify the underlying data (in this case, the Solr index).

Defining Fields for Solr

When defining the field names for the Hive table, keep in mind that the field types used to define the table in Hive are not sent to Solr when indexing data from a Hive table. The field names are sent, but not the field types. The field types must match or be compatible, however, for queries to complete properly.The reason why this might be a problem is due to a Solr feature called schema guessing. This is Solr’s default mode, and when it is enabled Solr looks at incoming data and makes a best guess at the field type.It can happen that Solr’s guess at the correct type may be different from the type defined in Hive, and if this happens you will get a ClassCastException in response to queries.To avoid this problem, you can use a Solr feature called dynamic fields. These direct Solr to use specific field types based on a prefix or suffix found on an incoming field name, which overrides Solr guessing at the type. Solr includes by default dynamic field rules for nearly all types it supports, so you only need to use the same suffix on your field names in your Hive tables for the correct type to be defined.To illustrate this, note the field names in the table example above:CREATE EXTERNAL TABLE solr (id string, field1_s string, field2_i int)In this example, we have defined the id field as a string, field1_s as a string, and field2_i as an integer. In Solr’s default schema, there is a dynamic field rule that any field with a _s suffix should be a string. Similarly, there is another rule that any field with _i as a suffix should be an integer. This allows us to make sure the field types match.An alternative to this is to disable Solr’s field guessing altogether, but this would require you to create all of your fields in Solr before indexing any content from Hive.For more information about these features and options, please see the following sections of the Apache Solr Reference Guide:

Table Properties

The following parameters can be set when defining the table properties:solr.zkhost: The location of the ZooKeeper quorum if using LucidWorks in SolrCloud mode. If this property is set along with the solr.server.url property, the solr.server.url property will take precedence.solr.server.url: The location of the Solr instance if not using LucidWorks in SolrCloud mode. If this property is set along with the solr.zkhost property, this property will take precedence.solr.collection: The Solr collection for this table. If not defined, an exception will be thrown.solr.query: The specific Solr query to execute to read this table. If not defined, a default of \*:* will be used. This property is not needed when loading data to a table, but is needed when defining the table so Hive can later read the table.lww.commit.on.close: If true, inserts will be automatically committed when the connection is closed. True is the default.lww.jaas.file: Used only when indexing to or reading from a Solr cluster secured with Kerberos. This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Solr and Hive. The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:
Client { --<1>
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="/data/solr-indexer.keytab" --<2>
  storeKey=true
  useTicketCache=true
  debug=true
  principal="solr-indexer@SOLRSERVER.COM"; --<3>
};
<1> The name of this section of the JAAS file. This name will be used with the lww.jaas.appname parameter. <2> The location of the keytab file. <3> The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and Hive.lww.jaas.appname: Used only when indexing to or reading from a Solr cluster secured with Kerberos. This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.

Indexing Data to Fusion

If you use Lucidworks Fusion, you can index data from Hive to Solr via Fusion’s index pipelines. These pipelines allow you several options for further transforming your data.
If you are using Fusion v3.0.x, you already have the Hive SerDe in Fusion’s ./apps/connectors/resources/lucid.hadoop/jobs directory. The SerDe jar that supports Fusion is v2.2.4 or higher. This was released with Fusion 3.0.If you are using Fusion 3.1.x and higher, you will need to download the Hive SerDe from http://lucidworks.com/connectors/. Choose the proper Hadoop distribution and the resulting .zip file will include the Hive SerDe.A 2.2.4 or higher jar built from this repository will also work with Fusion 2.4.x releases.
This is an example Hive command to create an external table to index documents in Fusion and to query the table later.
hive> CREATE EXTERNAL TABLE fusion (id string, field1_s string, field2_i int)
      STORED BY 'com.lucidworks.hadoop.hive.FusionStorageHandler'
      LOCATION '/tmp/fusion'
      TBLPROPERTIES('fusion.endpoints' = 'http://localhost:8764/api/apollo/index-pipelines/<pipeline>/collections/<collection>/index',
                    'fusion.fail.on.error' = 'false',
                    'fusion.buffer.timeoutms' = '1000',
                    'fusion.batchSize' = '500',
                    'fusion.realm' = 'KERBEROS',
                    'fusion.user' = 'fusion-indexer@FUSIONSERVER.COM',
                    'java.security.auth.login.config' = '/path/to/JAAS/file',
                    'fusion.jaas.appname' = 'FusionClient',
                    'fusion.query.endpoints' = 'http://localhost:8764/api/apollo/query-pipelines/pipeline-id/collections/collection-id',
                    'fusion.query' = '*:*');
In this example, we have created an external table named “fusion”, and defined a custom storage handler (STORED BY 'com.lucidworks.hadoop.hive.FusionStorageHandler') that a class included with the Hive SerDe jar designed for use with Fusion.Note that all of the same caveats about field types discussed in the section <<Defining Fields for Solr>> apply to Fusion as well. In Fusion, however, you have the option of using an index pipeline to perform specific field mapping instead of using dynamic fields.The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/fusion.In the section TBLPROPERTIES, we define several properties for Fusion so the data can be indexed to the right Fusion installation and collection:fusion.endpoints: The full URL to the index pipeline in Fusion. The URL should include the pipeline name and the collection data will be indexed to.fusion.fail.on.error: If true, when an error is encountered, such as if a row could not be parsed, indexing will stop. This is false by default.fusion.buffer.timeoutms: The amount of time, in milliseconds, to buffer documents before sending them to Fusion. The default is 1000. Documents will be sent to Fusion when either this value or fusion.batchSize is met.fusion.batchSize: The number of documents to batch before sending the batch to Fusion. The default is 500. Documents will be sent to Fusion when either this value or fusion.buffer.timeoutms is met.fusion.realm: This is used with fusion.user and fusion.password to authenticate to Fusion for indexing data. Two options are supported, KERBEROS or NATIVE. Kerberos authentication is supported with the additional definition of a JAAS file. The properties java.security.auth.login.config and fusion.jaas.appname are used to define the location of the JAAS file and the section of the file to use. Native authentication uses a Fusion-defined username and password. This user must exist in Fusion, and have the proper permissions to index documents.fusion.user: The Fusion username or Kerberos principal to use for authentication to Fusion. If a Fusion username is used ('fusion.realm' = 'NATIVE'), the fusion.password must also be supplied.fusion.password: This property is not shown in the example above. The password for the fusion.user when the fusion.realm is NATIVE.java.security.auth.login.config: This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Fusion and Hive. The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:
Client { --<1>
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="/data/fusion-indexer.keytab" --<2>
  storeKey=true
  useTicketCache=true
  debug=true
  principal="fusion-indexer@FUSIONSERVER.COM"; --<3>
};
<1> The name of this section of the JAAS file. This name will be used with the fusion.jaas.appname parameter. <2> The location of the keytab file. <3> The service principal name. This should be a different principal than the one used for Fusion, but must have access to both Fusion and Hive. This name is used with the fusion.user parameter described above.fusion.jaas.appname: Used only when indexing to or reading from Fusion when it is secured with Kerberos. This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.fusion.query.endpoints: The full URL to a query pipeline in Fusion. The URL should include the pipeline name and the collection data will be read from. You should also specify the request handler to be used. If you do not intend to query your Fusion data from Hive, you can skip this parameter.fusion.query: The query to run in Fusion to select records to be read into Hive. This is \*:* by default, which selects all records in the index. If you do not intend to query your Fusion data from Hive, you can skip this parameter.

Query and Insert Data to Hive

Once the table is configured, any syntactically correct Hive query will be able to query the index.For example, to select three fields named “id”, “field1_s”, and “field2_i” from the “solr” table, you would use a query such as:
hive> SELECT id, field1_s, field2_i FROM solr;
Replace the table name as appropriate to use this example with your data.To join data from tables, you can make a request such as:
hive> SELECT id, field1_s, field2_i FROM solr left
      JOIN sometable right
      WHERE left.id = right.id;
And finally, to insert data to a table, simply use the Solr table as the target for the Hive INSERT statement, such as:
hive> INSERT INTO solr
      SELECT id, field1_s, field2_i FROM sometable;

Example Indexing Hive to Solr

Solr includes a small number of sample documents for use when getting started. One of these is a CSV file containing book metadata. This file is found in your Solr installation, at $SOLR_HOME/example/exampledocs/books.csv.Using the sample books.csv file, we can see a detailed example of creating a table, loading data to it, and indexing that data to Solr.
CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; --<1>

LOAD DATA LOCAL INPATH '/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books; --<2>

CREATE EXTERNAL TABLE solr (id STRING, cat_s STRING, title_s STRING, price_f FLOAT, in_stock_b BOOLEAN, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) --<3>
     STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' --<4>
     LOCATION '/tmp/solr' --<5>
     TBLPROPERTIES('solr.zkhost' = 'zknode1:2181,zknode2:2181,zknode3:2181/solr',
                   'solr.collection' = 'gettingstarted',
                   'solr.query' = '*:*', --<6>
                   'lww.jaas.file' = '/data/jaas-client.conf'); --<7>

INSERT OVERWRITE TABLE solr SELECT b.* FROM books b;
<1> Define the table books, and provide the field names and field types that will make up the table. <2> Load the data from the books.csv file. <3> Create an external table named solr, and provide the field names and field types that will make up the table. These will be the same field names as in your local Hive table, so we can index all of the same data to Solr. <4> Define the custom storage handler provided by the {packageUser}-hive-serde-{connectorVersion}.jar. <5> Define storage location in HDFS. <6> The query to run in Solr to read records from Solr for use in Hive. <7> Define the location of Solr (or ZooKeeper if using SolrCloud), the collection in Solr to index the data to, and the query to use when reading the table. This example also refers to a JAAS configuration file that will be used to authenticate to the Kerberized Solr cluster.

Port configuration

Fusion services run in their own JVM and listen for requests on a number of ports. Environment variables, set in a common configuration file, are used to specify the port a service uses. To change the port(s) a service uses, you must change the settings in the configuration file.

Default ports

This table lists the default port numbers used by Fusion processes. Port settings are defined in the :fusion.properties file in https://FUSION_HOST:FUSION_PORT/conf/ (on Unix or macOS) or fusion\4.2.x\conf\ (on Windows).
PortService
8091Fusion agent
8763Fusion UI service (use port 8764 to access the Fusion UI)
8764Fusion proxy This service includes the Fusion Authorization Proxy.
8765Fusion API Services
8766Spark Master
8769Spark Worker
8771Connectors RPC Service This service can distribute connector jobs to as many Fusion nodes as you want. It uses HTTP/2 and has an SDK that you can use to build your own connectors.
8780Web Apps This service delivers the UIs of Fusion apps.
8781Log shipper Monitoring port that agent uses to check the health of the log shipper process. This port does not need to be accessible from other nodes.
8983Solr This is the embedded Solr instance included in the Fusion distribution.
8984Connectors Classic Service This service runs nondistributed connector jobs. It uses HTTP/1.1 and has no SDK.
9983ZooKeeper The embedded ZooKeeper used by Fusion services.
Important

The ZooKeeper port is also defined in the configuration file for the embedded ZooKeeper, https://FUSION_HOST:FUSION_PORT/conf/zookeeper/zoo.cfg (on Unix or macOS) or fusion\4.2.x\conf\zookeeper\zoo.cfg (on Windows). Look for clientPort. If you run Fusion with the embedded ZooKeeper, remember to change the port number in both places.
47100-48099Apache Ignite TCP communication port range (used by the API, Connectors Classic, Connectors RPC, and Proxy services)
48100-48199Apache Ignite shared memory port range (used by the API, Connectors Classic, Connectors RPC, and Proxy services)
49200-49299Apache Ignite discovery port range (used by API, Connectors Classic, Connectors RPC, and Proxy services)

Jetty ports

Jetty is used to run the Admin UI, API, Connectors Classic, Proxy, Solr, and Web Apps services. For each of these services, Jetty runs the service on the assigned port and listens on a second port for shutdown requests. Therefore, fusion.properties defines pairs of ports for components running on Jetty, such as:
api.port = ${API_PORT:-8765}
api.stopPort = 7765

Spark ports

This table lists the default port numbers used by Spark processes in Fusion.
Port numberProcess
4040SparkContext web UI
7337Shuffle port for Apache Spark worker
8767Spark master web UI
8770Spark worker web UI
8766Spark master listening port
8769Spark worker listening port
8772 (spark.driver.port)Spark driver listening port
8788 (spark.blockManager.port)Spark BlockManager port
If a port is not available, Spark uses the next available port by adding 1 to the assigned port number. For example, if 4040 is not available, Spark uses 4041 (if available, or 4042, and so forth). Ensure that the ports in the above table are accessible, as well as a range of up to 16 subsequent ports. For example, open ports 8772 through 8787, and 8788 through 8804, because a single node can have more than one Spark driver and Spark BlockManager.

Directories

The directory where the Fusion files go for a specific version of Fusion is the Fusion home directory. The Fusion home directory is a version-numbered directory (for example, 4.2.2) below the directory fusion. This installation strategy lets you install multiple versions of Fusion and switch between them. The directories found in the Fusion home directory in https://FUSION_HOST:FUSION_PORT/ (on Unix or macOS) or fusion\4.2\ (on Windows) are:
NameDescription
appsFusion components 3rd-party distributions used by Fusion, including jar files and plugins
binMaster script to run Fusion, and per-component run scripts
confConfiguration files for Fusion and ZooKeeper that contain parameters settings tuned for common use cases
dataDefault location of data stores used by Fusion apps
docsLicense information
examplesFusion signals example
initsystemd and upstart scripts and configurations for Linux
scriptsDeveloper utilities, including diagnostic scripts, for Linux and Windows. See scripts/diag/linux/README and scripts/diag/win64/README.txt for details.
varLog files and system files created by Fusion components, as well as .pid files for each running process
To simplify access to the latest version of Fusion and to files in the bin, conf, and var directories, Fusion creates a symbolic link latest to the latest version and symbolic links bin, conf, and var to latest/bin, latest/conf, and latest/var respectively. For example, if latest is 4.2.1, then instead of entering this command to change to the bin directory:
cd /path/to/fusion/4.2.1/bin
You could just type:
cd /path/to/fusion/latest/bin
To avoid possible confusion in the documentation, we spell out the path below the fusion directory. From the fusion directory, you can view the symbolic links by typing:
find . -maxdepth 1 -type l -ls
To change the version of Fusion to which the symbolic links refer, unlink and relink latest. For example:
cd /path/to/fusion
unlink latest
link -s 4.2.2 latest

Log files

Log files are found in directories under https://FUSION_HOST:FUSION_PORT/var/log/ (on Unix or macOS) or fusion\4.2.x\var\log\ (on Windows). Because the Fusion components run in separate JVMs, each component has its own set of log files and files that monitor all garbage-collection events for that process.
NameDescription
admin-ui, webappsFusion UI messages. Messages are logged to jetty-<date>.stderrout.log.
agentFusion agent logging and error messages
apiFusion REST API services logging and error messages. This log shows the result of service requests submitted to the REST API directly via HTTP and indirectly via the Fusion UI.
connectorsFusion connector services logging and error messages. Fusion index pipeline logging stages write to this file.
log-shipperSee Configure Fusion logging
proxyMessages from the Fusion proxy, responsible for authentication and HTTP load balancing.
solrMessages from Solr
spark-masterSpark-master logs
spark-workerSpark-worker logs
sqlSQL logs
zookeeperZooKeeper messages
Every component logs all messages to a log file named <component>.log. For example, the full path to the log file for the connectors services is https://FUSION_HOST:FUSION_PORT/var/log/connectors/connectors.log (on Unix or macOS) or fusion\4.2.x\var\log\connectors\connectors.log (on Windows). In addition to component log files, every component maintains a set of garbage-collection log files that are used for resource tuning. The garbage-collection log files are named gc_<YYYYMMDD>_<PID>.log.<CT>. In addition, the current garbage-collection log file has suffix .current. The Fusion API, Fusion UI, Connectors Classic, Proxy, Web Apps, and Solr services all run inside a Jetty server. The Jetty server logs are also written to each component’s log file directory. The Jetty server logs are named:
  • jetty-YYYY_MM_DD.request.log
  • jetty-YYYY_MM_DD.stderrout.log
I