The Apache Hadoop 2 Connector is a MapReduce-enabled crawler that is compatible with Apache Hadoop v2.x.
The connector services must be able to access the Hadoop client in file
$HADOOP_HOME/bin/hadoop, so it must either be installed on one of the nodes of the Hadoop cluster (such as the
nameNode), or a client supported by your specific distribution must be installed on the same server as the connectors. The Hadoop client must be configured to access the Hadoop cluster so the crawler is able to access the Hadoop cluster for content processing.
|Instructions for setting up any of the supported Hadoop distributions is beyond the scope of this document. We recommend reading one of the many tutorials found online or one of the books on Hadoop.|
This connector writes to the
hadoop.tmp.dir and the
/tmp directory in HDFS, so Fusion should be started by a user who has read/write permissions for both.
Using any flavor of Hadoop, you will need to be aware of the way Hadoop and systems based on Hadoop (such as CDH, MapR, etc.) handle permissions for services that communicate with other nodes.
Hadoop services execute under specific user credentials: a quadruplet consisting of user name, group name, numeric user id, numeric group id. Installations that follow the manual usually use user 'mapr' and group 'mapr' (or similar), with numeric ids assigned by the operating system (e.g., uid=1000, gid=20). When the system is configured to enforce user permissions (which is the default in some systems), any client that connects to Hadoop services has to use a quadruplet that exists on the server. This means that ALL values in this quadruplet must be equal between the client and the server, i.e., an account with the same user, group, uid, and gid must exist on both client and server machines.
When a client attempts to access a resource on Hadoop filesystems (or the
JobTracker, which also uses this authentication method) it sends its credentials, which are looked up on the server, and if an exactly matching record is found then those local permissions will be used to determine read/write access. If no such account is found then the user is treated as "other" in the sense of the permission model.
This means that the crawlers for the HDFS data source should be able to crawl Hadoop or MapR filesystems without any authentication, as long as there is a read (and execute for directories) access for "other" users granted on the target resources. Authenticated users will be able to access resources owned by their equivalent account.
However, the Hadoop crawling described on this page require write access to a
/tmp directory to use as a working directory. In many cases, this directory does not exist, or if it does, it doesn’t have write access to "other" (not authenticated) users. Therefore users of these data sources should make sure that there is a
/tmp directory on the target filesystem that is writable using their local user credentials, be it a recognized user, group, or "other". If a local user is recognized by the server then it’s enough to create a
/tmp directory that is owned by that user. If there is no such user, then the
/tmp directory must be modified to have write permissions for "other" users. The working directory can be modified to be another directory that can be used for temporary working storage that has the correct permissions.
Configuration for a Kerberos Hadoop cluster
Kerberos is a system that provides authenticated access for users and services on a network. Instead of sending passwords in plaintext over the network, encrypted passwords are used to generate time-sensitive tickets which are used for authentication. Kerberos uses symmetric-key cryptography and a trusted third party called a Key Distribution Center (KDC) to authenticate users to a suite of network services. When a user authenticates to the KDC, the KDC sends a set of credentials (a ticket) specific to that session back to the user’s machine.
To work with a Kerberized Hadoop cluster you must have a set of credentials. These are generated by running the "kinit" program. The datasource can be configured to run this program, in which case, the following information must be specified: the full path to the program, the Kerberos principal name, the location of a keytab file and the name of the file in which to store the ticket.