Compatible with Fusion version: 4.0.0 through 4.2.6
Deprecation and removal noticeThis connector is deprecated as of Fusion 4.2 and is removed or expected to be removed as of Fusion 5.0.For more information about deprecations and removals, including possible alternatives, see Deprecations and Removals.
How Hadoop connectors work
The Hadoop crawlers take full advantage of the scaling abilities of the MapReduce architecture and will use all of the nodes available in the cluster just like any other MapReduce job. This has significant ramifications for performance since it is designed to move a lot of content, in parallel, as fast as possible (depending on the system’s capabilities), from its raw state to the Fusion index. The Hadoop crawlers work in stages:- Create one or more SequenceFiles from the raw content. This can be done in one of two ways:
- If the source files are available in a shared Hadoop filesystem, prepare a list of source files and their locations as a SequenceFile. The raw contents of each file are not processed until step 2.
- If the source files are not available, prepare a list of source files and the raw content, stored as a Behemoth document. This process is currently done sequentially and can take a significant amount of time if there is a large number of documents and/or if they are very large.
- Run a MapReduce job to extract text and metadata from the raw content using Apache Tika. This is similar to the Fusion approach of extracting content from crawled documents, except it is done with MapReduce.
- Run a MapReduce job to send the extracted content from HDFS to the index pipeline for further processing.
Fusion login configuration file
The Fusion login config file required by the datasource configuration parameterlogin_config is a Java Authentication and Authorization Service (JAAS) configuration file which needs to be present on every mapper/reducer node which will inject data to Fusion.
Here is a sample file that describes the structure expected by Fusion:
FusionClient is the application name and can be set to anything. Be sure to set the login_app_name parameter to the same value if you change it. The other parameters can be configured as required and the keyTab value should point to the location on the node where the keytab file can be found.
Learn more
Configure the Hadoop Client
Configure the Hadoop Client
The Apache Hadoop 2 Connector is a MapReduce-enabled crawler that is compatible with Apache Hadoop v2.x.The connector services must be able to access the Hadoop client in file This connector writes to the
$HADOOP_HOME/bin/hadoop, so it must either be installed on one of the nodes of the Hadoop cluster (such as the nameNode), or a client supported by your specific distribution must be installed on the same server as the connectors. The Hadoop client must be configured to access the Hadoop cluster so the crawler is able to access the Hadoop cluster for content processing.Instructions for setting up any of the supported Hadoop distributions is beyond the scope of this document. We recommend reading one of the many tutorials found online or one of the books on Hadoop.
hadoop.tmp.dir and the /tmp directory in HDFS, so Fusion should be started by a user who has read/write permissions for both.Permission issues
Using any flavor of Hadoop, you will need to be aware of the way Hadoop and systems based on Hadoop (such as CDH, MapR, etc.) handle permissions for services that communicate with other nodes.Hadoop services execute under specific user credentials: a quadruplet consisting of user name, group name, numeric user id, numeric group id. Installations that follow the manual usually use user ‘mapr’ and group ‘mapr’ (or similar), with numeric ids assigned by the operating system (e.g., uid=1000, gid=20). When the system is configured to enforce user permissions (which is the default in some systems), any client that connects to Hadoop services has to use a quadruplet that exists on the server. This means that ALL values in this quadruplet must be equal between the client and the server, i.e., an account with the same user, group, uid, and gid must exist on both client and server machines.When a client attempts to access a resource on Hadoop filesystems (or theJobTracker, which also uses this authentication method) it sends its credentials, which are looked up on the server, and if an exactly matching record is found then those local permissions will be used to determine read/write access. If no such account is found then the user is treated as “other” in the sense of the permission model.This means that the crawlers for the HDFS data source should be able to crawl Hadoop or MapR filesystems without any authentication, as long as there is a read (and execute for directories) access for “other” users granted on the target resources. Authenticated users will be able to access resources owned by their equivalent account.However, the Hadoop crawling described on this page require write access to a /tmp directory to use as a working directory. In many cases, this directory does not exist, or if it does, it does not have write access to “other” (not authenticated) users. Therefore users of these data sources should make sure that there is a /tmp directory on the target filesystem that is writable using their local user credentials, be it a recognized user, group, or “other”. If a local user is recognized by the server then it is enough to create a /tmp directory that is owned by that user. If there is no such user, then the /tmp directory must be modified to have write permissions for “other” users. The working directory can be modified to be another directory that can be used for temporary working storage that has the correct permissions.