Fusion Components and Deployments
The Fusion platform is designed to support enterprise search applications at any scale. Fusion can be deployed across multiple servers in order to store large amounts of data or to achieve high processing throughput or both, and the set of Fusion components running on each server can be adjusted to meet these processing requirements.
The Fusion platform is comprised of a series of Java programs, each of which runs in its own JVM. Apache ZooKeeper provides the shared, synchronized data store for all user and application configuration information. The following diagram shows the full set of Fusion processes that run on a single server and the default ports used by each, which arrows representing the flow of HTTP requests between components for document search and indexing:
The inputs to this diagram represent:
Users working directly in the Fusion UI, whether for develop and refine search applications, view analytics dashboards, or perform system administration tasks. They interact directly with Fusion’s UI component, which relays all requests to the API Services.
Search queries which originate from the search application are sent to the Fusion UI for authentication. The Fusion UI send the request to the Fusion API services which invokes the a query pipeline to build out the raw query and sends the resulting query to Solr.
Fusion datasources ingest data that will be indexed into a Solr collection. The datasource sends this raw data to Fusion’s connector services. The connector invokes an index pipeline to extract, transform, and otherwise enrich the raw data and sends the resulting document to Solr for indexing.
Signal processing and aggregations are carried out by Apache Spark. The Apache Spark master distributes tasks across one or more worker processes.
Apache ZooKeeper is included in this diagram, since all Fusion processes across all servers in a Fusion deployment communicate with the ZooKeeper ensemble at the socket layer via ZooKeeper’s Java API.
The Fusion Agent process is the server process which starts, stops, and monitors all Fusion components running on the server.
See these topics for details about each component:
Fusion can be deployed across multiple servers where each server is a node in the Fusion deployment and a single ZooKeeper ensemble is used as the centralized, synchronized store for both application configurations and user access information. For applications over very large collections or which require high-throughput or high availability or both, the Fusion deployment consist of multiple servers. Every node in the deployment runs the Fusion API Services process. Beyond that, the set of proceses running on a particular node depends on the processing and throughput needs of the search application.
Running Solr on all Fusion nodes scales out document storage as well as providing data replication. (Alternatively, external SolrCloud clusters can be used to store Fusion collections, see Integrating with existing Solr instances.)
Running Fusion Connectors on multiple nodes provides high throughput for indexing and updates, e.g., applications which run analytics over live data streams such as logfile indexing or mobile tracking devices.
Running the Fusion UI on two or more nodes provides failover for Fusion’s authentication proxy.
Running Apache Spark on multiple nodes provides processing power for applications which aggregate clicks and other signals or use Fusion machine learning components.