Compatible with Fusion version: 4.0.0 through 4.2.6
Deprecation and removal noticeThis connector is deprecated as of Fusion 4.2 and is removed or expected to be removed as of Fusion 5.0.For more information about deprecations and removals, including possible alternatives, see Deprecations and Removals.
Learn more
Configure the Hadoop Client
Configure the Hadoop Client
The Apache Hadoop 2 Connector is a MapReduce-enabled crawler that is compatible with Apache Hadoop v2.x.The connector services must be able to access the Hadoop client in file This connector writes to the
$HADOOP_HOME/bin/hadoop, so it must either be installed on one of the nodes of the Hadoop cluster (such as the nameNode), or a client supported by your specific distribution must be installed on the same server as the connectors. The Hadoop client must be configured to access the Hadoop cluster so the crawler is able to access the Hadoop cluster for content processing.Instructions for setting up any of the supported Hadoop distributions is beyond the scope of this document. We recommend reading one of the many tutorials found online or one of the books on Hadoop.
hadoop.tmp.dir and the /tmp directory in HDFS, so Fusion should be started by a user who has read/write permissions for both.Permission issues
Using any flavor of Hadoop, you will need to be aware of the way Hadoop and systems based on Hadoop (such as CDH, MapR, etc.) handle permissions for services that communicate with other nodes.Hadoop services execute under specific user credentials: a quadruplet consisting of user name, group name, numeric user id, numeric group id. Installations that follow the manual usually use user ‘mapr’ and group ‘mapr’ (or similar), with numeric ids assigned by the operating system (e.g., uid=1000, gid=20). When the system is configured to enforce user permissions (which is the default in some systems), any client that connects to Hadoop services has to use a quadruplet that exists on the server. This means that ALL values in this quadruplet must be equal between the client and the server, i.e., an account with the same user, group, uid, and gid must exist on both client and server machines.When a client attempts to access a resource on Hadoop filesystems (or theJobTracker, which also uses this authentication method) it sends its credentials, which are looked up on the server, and if an exactly matching record is found then those local permissions will be used to determine read/write access. If no such account is found then the user is treated as “other” in the sense of the permission model.This means that the crawlers for the HDFS data source should be able to crawl Hadoop or MapR filesystems without any authentication, as long as there is a read (and execute for directories) access for “other” users granted on the target resources. Authenticated users will be able to access resources owned by their equivalent account.However, the Hadoop crawling described on this page require write access to a /tmp directory to use as a working directory. In many cases, this directory does not exist, or if it does, it does not have write access to “other” (not authenticated) users. Therefore users of these data sources should make sure that there is a /tmp directory on the target filesystem that is writable using their local user credentials, be it a recognized user, group, or “other”. If a local user is recognized by the server then it is enough to create a /tmp directory that is owned by that user. If there is no such user, then the /tmp directory must be modified to have write permissions for “other” users. The working directory can be modified to be another directory that can be used for temporary working storage that has the correct permissions.