This article describes features or functionality that are only compatible with Fusion 4.x.
How Hadoop connectors work
The Hadoop crawlers take full advantage of the scaling abilities of the MapReduce architecture and will use all of the nodes available in the cluster just like any other MapReduce job. This has significant ramifications for performance since it is designed to move a lot of content, in parallel, as fast as possible (depending on the system’s capabilities), from its raw state to the Fusion index. The Hadoop crawlers work in stages:- Create one or more SequenceFiles from the raw content. This can be done in one of two ways:
- If the source files are available in a shared Hadoop filesystem, prepare a list of source files and their locations as a SequenceFile. The raw contents of each file are not processed until step 2.
- If the source files are not available, prepare a list of source files and the raw content, stored as a Behemoth document. This process is currently done sequentially and can take a significant amount of time if there is a large number of documents and/or if they are very large.
- Run a MapReduce job to extract text and metadata from the raw content using Apache Tika. This is similar to the Fusion approach of extracting content from crawled documents, except it is done with MapReduce.
- Run a MapReduce job to send the extracted content from HDFS to the index pipeline for further processing.
Fusion login configuration file
The Fusion login config file required by the datasource configuration parameterlogin_config
is a Java Authentication and Authorization Service (JAAS) configuration file which needs to be present on every mapper/reducer node which will inject data to Fusion.
Here is a sample file that describes the structure expected by Fusion:
FusionClient
is the application name and can be set to anything. Be sure to set the login_app_name
parameter to the same value if you change it. The other parameters can be configured as required and the keyTab
value should point to the location on the node where the keytab file can be found.