Fusion Architecture

The Fusion platform is designed to support enterprise search applications at any scale. Fusion can be deployed across multiple servers in order to store large amounts of data or to achieve high processing throughput or both, and the set of Fusion components running on each server can be adjusted to meet these processing requirements.

Fusion Platform Component Architecture

The Fusion platform is comprised of a series of Java programs, each of which runs in its own JVM. Apache ZooKeeper provides the shared, synchronized data store for all user and application configuration information. The following diagram shows the full set of Fusion processes that run on a single server and the default ports used by each, with arrows representing the flow of HTTP requests between components for document search and indexing:

Search queries, which originate from the search application, are sent to the Fusion UI for authentication. The Fusion UI sends the request to the Fusion API services component, which invokes the a query pipeline to build out the raw query and send the resulting query to Solr. Document indexing is carried out by Fusion datasources which send raw data to Fusion’s connector services. The connector invokes an index pipeline to extract, transform, and otherwise enrich the raw data and sends the resulting document to Solr for indexing. Apache Spark is used for signal processing and aggregation. All Fusion processes across all servers in a Fusion deployment communicate with the ZooKeeper ensemble at the socket layer using ZooKeeper’s Java API.

See these topics for details about each component:

Fusion Platform Deployment Architecture

For Enterprise applications that consist of very large collections or that require high-throughput or high availability or both, the Fusion deployment will consist of multiple servers. Each server is a Fusion node. All nodes in a Fusion deployment communicate with a common ZooKeeper cluster.

Every Fusion node in a deployment runs the Fusion API Services process. Beyond that, the set of processes running on a particular node depends on the processing and throughput needs of the search application.

  • Running Solr on all Fusion nodes scales out document storage as well as providing data replication. (Alternatively, external SolrCloud clusters can be used to store Fusion collections, see Integrating with existing Solr instances.)

  • Running Fusion Connectors on multiple nodes provides high throughput for indexing and updates, e.g., applications which run analytics over live data streams such as logfile indexing or mobile tracking devices.

  • Running the Fusion UI on two or more nodes provides failover for Fusion’s authentication proxy.

  • Running Apache Spark on multiple nodes provides processing power for applications that aggregate clicks and other signals, or that use Fusion machine learning components.