> ## Documentation Index
> Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Apache Hadoop 2 V1

> The Apache Hadoop 2 Connector is a MapReduce-enabled crawler that is compatible with Apache Hadoop

export const schema = {
  "category": "Other",
  "categoryPriority": 1,
  "description": "Connector for using a Hadoop cluster to process documents and forward them to Solr for indexing. This uses a Hadoop job jar to pass arguments to Hadoop for processing with MapReduce.",
  "properties": {
    "category": {
      "default": "Hadoop cluster",
      "hints": ["hidden", "readonly"],
      "title": "Category",
      "type": "string"
    },
    "connector": {
      "description": "Connector Type.",
      "hints": ["hidden"],
      "minLength": 1,
      "title": "Connector Type",
      "type": "string"
    },
    "description": {
      "description": "Optional description for this datasource.",
      "title": "Description",
      "type": "string"
    },
    "id": {
      "description": "Unique name for this datasource.",
      "minLength": 1,
      "pattern": "^[a-zA-Z0-9_-]+$",
      "title": "Datasource ID",
      "type": "string"
    },
    "pipeline": {
      "description": "Name of an existing index pipeline for processing documents.",
      "minLength": 1,
      "title": "Pipeline ID",
      "type": "string"
    },
    "properties": {
      "description": "Datasource configuration properties",
      "properties": {
        "collection": {
          "description": "Collection documents will be indexed to.",
          "hints": ["hidden"],
          "pattern": "^[a-zA-Z0-9_-]+$",
          "title": "Collection",
          "type": "string"
        },
        "db": {
          "description": "Type and properties for a ConnectorDB implementation to use with this datasource.",
          "hints": ["hidden"],
          "properties": {
            "aliases": {
              "default": false,
              "description": "Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.",
              "title": "Process Aliases?",
              "type": "boolean"
            },
            "inlinks": {
              "default": false,
              "description": "Keep track of incoming links. This negatively impacts performance and size of DB.",
              "title": "Process Inlinks?",
              "type": "boolean"
            },
            "inv_aliases": {
              "default": false,
              "description": "Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.",
              "title": "Process Inverted Aliases?",
              "type": "boolean"
            },
            "type": {
              "default": "com.lucidworks.connectors.db.impl.MapDbConnectorDb",
              "description": "Fully qualified class name of ConnectorDb implementation.",
              "minLength": 1,
              "title": "Implementation Class Name",
              "type": "string"
            }
          },
          "required": ["type"],
          "title": "Connector DB",
          "type": "object"
        },
        "fusion_batchsize": {
          "default": 500,
          "description": "Fusion Client Batch Size",
          "exclusiveMinimum": true,
          "hints": ["advanced"],
          "minimum": 1,
          "title": "Batch Size",
          "type": "integer"
        },
        "fusion_buffer_timeoutms": {
          "default": 1000,
          "description": "Fusion Client Timeout (ms).",
          "exclusiveMinimum": true,
          "hints": ["advanced"],
          "minimum": 1,
          "title": "Timeout (ms)",
          "type": "integer"
        },
        "fusion_endpoints": {
          "default": ["http://localhost:8764"],
          "items": {
            "default": "http://localhost:8764",
            "format": "url",
            "type": "string"
          },
          "minItems": 1,
          "title": "List of Fusion Endpoints",
          "type": "array"
        },
        "fusion_fail_on_error": {
          "default": false,
          "description": "Fusion Client Fail on Error",
          "hints": ["advanced"],
          "title": "Fail on Error",
          "type": "boolean"
        },
        "fusion_login_app_name": {
          "default": "FusionClient",
          "description": "Login Config App Name FusionClient by default.",
          "title": "Config App Name",
          "type": "string"
        },
        "fusion_login_config": {
          "description": "The file path of Login Configuration for Fusion kerberized, it must be placed in every mapper/reduce node.",
          "title": "Login Config",
          "type": "string"
        },
        "fusion_password": {
          "description": "Fusion client User's password, leave empty if kerberos is use.",
          "hints": ["secret"],
          "title": "Password",
          "type": "string"
        },
        "fusion_realm": {
          "default": "NATIVE",
          "description": "Fusion's Realm, If 'native' is selected the password is mandatory. If 'kerberos' is selected the Login Configuration is mandatory.",
          "enum": ["NATIVE", "KERBEROS"],
          "title": "Fusion client's Authentication",
          "type": "string"
        },
        "fusion_user": {
          "description": "Fusion client's User or Principal if Kerberos is chosen.",
          "title": "User/Principal",
          "type": "string"
        },
        "hadoop_home": {
          "description": "Path to the Hadoop home directory where $HADOOP_HOME/bin/hadoop can be found. The connector requires access to either a full Hadoop installation, or a Hadoop client provided by your Hadoop distribution that has been configured to access the Hadoop installation.",
          "minLength": 1,
          "title": "Hadoop home",
          "type": "string"
        },
        "hadoop_input": {
          "description": "Hadoop input source file/directory",
          "minLength": 1,
          "title": "Input source",
          "type": "string"
        },
        "hadoop_mapper": {
          "default": "CSV",
          "description": "Hadoop Ingest Mapper",
          "enum": ["CSV", "DIRECTORY", "GROK", "REGEX", "SEQUENCE_FILE", "SOLR_XML", "WARC", "ZIP"],
          "title": "Mapper",
          "type": "string"
        },
        "initial_mapping": {
          "category": "Field Transformation",
          "categoryPriority": 6,
          "description": "Provides mapping of fields before documents are sent to an index pipeline.",
          "hints": ["advanced"],
          "properties": {
            "condition": {
              "description": "Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.",
              "hints": ["code", "code/javascript", "advanced"],
              "title": "Condition",
              "type": "string"
            },
            "label": {
              "description": "A unique label for this stage.",
              "hints": ["advanced"],
              "maxLength": 255,
              "title": "Label",
              "type": "string"
            },
            "mappings": {
              "description": "List of mapping rules",
              "hints": ["advanced"],
              "items": {
                "properties": {
                  "operation": {
                    "default": "copy",
                    "description": "The type of mapping to perform: move, copy, delete, add, set, or keep.",
                    "enum": ["copy", "move", "delete", "set", "add", "keep"],
                    "hints": ["advanced"],
                    "title": "Operation",
                    "type": "string"
                  },
                  "source": {
                    "description": "The name of the field to be mapped.",
                    "hints": ["advanced"],
                    "title": "Source Field",
                    "type": "string"
                  },
                  "target": {
                    "description": "The name of the field to be mapped to.",
                    "hints": ["advanced"],
                    "title": "Target Field",
                    "type": "string"
                  }
                },
                "required": ["source"],
                "type": "object"
              },
              "title": "Field Mappings",
              "type": "array"
            },
            "reservedFieldsMappingAllowed": {
              "default": false,
              "hints": ["advanced"],
              "title": "Allow System Fields Mapping?",
              "type": "boolean"
            },
            "skip": {
              "default": false,
              "description": "Set to true to skip this stage.",
              "hints": ["advanced"],
              "title": "Skip This Stage",
              "type": "boolean"
            },
            "unmapped": {
              "description": "If fields do not match any of the field mapping rules, these rules will apply.",
              "hints": ["advanced"],
              "properties": {
                "operation": {
                  "default": "copy",
                  "description": "The type of mapping to perform: move, copy, delete, add, set, or keep.",
                  "enum": ["copy", "move", "delete", "set", "add", "keep"],
                  "hints": ["advanced"],
                  "title": "Operation",
                  "type": "string"
                },
                "source": {
                  "description": "The name of the field to be mapped.",
                  "hints": ["advanced"],
                  "title": "Source Field",
                  "type": "string"
                },
                "target": {
                  "description": "The name of the field to be mapped to.",
                  "hints": ["advanced"],
                  "title": "Target Field",
                  "type": "string"
                }
              },
              "required": ["source"],
              "title": "Unmapped Fields",
              "type": "object"
            }
          },
          "title": "Initial field mapping",
          "type": "object",
          "unsafe": false
        },
        "job_jar": {
          "default": "lucidworks-hadoop-job-2.2.7.jar",
          "description": "Path and name of the Hadoop job jar. Unless you are using a custom job jar, the default provided by Fusion is preferred.",
          "minLength": 1,
          "title": "Job Jar",
          "type": "string"
        },
        "job_jar_path": {
          "description": "The hadoop job path added by the connector.",
          "hints": ["hidden", "readonly"],
          "title": "job_jar_path",
          "type": "string"
        },
        "kinit_cache": {
          "description": "Full path of 'kerberos' cache. If this path does not exist, it will be created.",
          "title": "'kerberos' cache",
          "type": "string"
        },
        "kinit_cmd": {
          "default": "kinit",
          "description": "Full path to the 'kinit' binary.",
          "title": "'kinit' command",
          "type": "string"
        },
        "kinit_keytab": {
          "description": "Full path to the Kerberos keytab file.",
          "title": "'kerberos' keytab",
          "type": "string"
        },
        "kinit_principal": {
          "description": "Kerberos principal name, i.e., username@YOUR-REALM.COM",
          "title": "'kerberos' principal",
          "type": "string"
        },
        "mapper_args": {
          "description": "Parameters for the Hadoop job.",
          "items": {
            "properties": {
              "arg_name": {
                "description": "Parameter Name",
                "enum": ["csvFieldMapping", "csvDelimiter", "csvFirstLineComment", "csvStrategy", "idField", "add.subdirectories", "grok.uri", "grok.config.path", "grok.additional.patterns", "com.lucidworks.hadoop.ingest.RegexIngestMapper.regex", "com.lucidworks.hadoop.ingest.RegexIngestMapper.groups_to_fields", "com.lucidworks.hadoop.ingest.RegexIngestMapper.match"],
                "title": "name",
                "type": "string"
              },
              "arg_value": {
                "description": "Parameter Value",
                "hints": ["lengthy"],
                "title": "value",
                "type": "string"
              }
            },
            "type": "object"
          },
          "title": "Job Jar arguments",
          "type": "array"
        },
        "reducers": {
          "default": 0,
          "description": "(Expert) Depending on the OutputFormat and your system resources, you may wish to have Hadoop do a reduce step first so as to not open too many connections to the output resource",
          "exclusiveMinimum": false,
          "hints": ["advanced"],
          "minimum": 0,
          "title": "Number of Reducers",
          "type": "integer"
        },
        "run_kinit": {
          "default": false,
          "description": "If your Hadoop installation requires job requests to authenticate with Kerberos, this option will allow Fusion to run 'kinit' to get a valid ticket.",
          "title": "Run 'kinit'",
          "type": "boolean"
        }
      },
      "propertyGroups": [{
        "label": "Fusion Client",
        "properties": ["fusion_endpoints", "fusion_realm", "fusion_user", "fusion_password", "fusion_login_config", "fusion_login_app_name", "fusion_batchsize", "fusion_buffer_timeoutms", "fusion_fail_on_error"]
      }, {
        "label": "Kerberos Authentication",
        "properties": ["run_kinit", "kinit_cmd", "kinit_principal", "kinit_keytab", "kinit_cache"]
      }, {
        "label": "Field Mapping",
        "properties": ["initial_mapping"]
      }],
      "required": ["hadoop_home", "job_jar", "hadoop_input", "hadoop_mapper", "fusion_endpoints", "fusion_user"],
      "title": "Properties",
      "type": "object"
    },
    "type": {
      "description": "Datasource type supported by the selected connector type.",
      "hints": ["hidden"],
      "minLength": 1,
      "title": "Datasource Type",
      "type": "string"
    },
    "type_description": {
      "default": "Connector for using a Hadoop cluster to process documents and forward them to Solr for indexing. This uses a Hadoop job jar to pass arguments to Hadoop for processing with MapReduce.",
      "hints": ["hidden", "readonly"],
      "title": "Type Description",
      "type": "string"
    }
  },
  "required": ["id", "connector", "type", "pipeline", "properties"],
  "title": "Apache Hadoop 2",
  "type": "object",
  "unsafe": false
};

export const SchemaParamFields = ({schema}) => {
  const sanitize = str => {
    if (typeof str !== "string") return str;
    return str.replace(/^"(.*)"$/s, "$1").replace(/\\/g, "").replace(/"/g, "'");
  };
  const formatDescription = str => {
    const s = sanitize(str);
    return (/[.!?]\)*$/).test(s) ? s : `${s}.`;
  };
  const {description, properties = {}, required: requiredProps = []} = schema;
  const visibleProps = useMemo(() => Object.entries(properties).filter(([, prop]) => !prop.hints?.includes("hidden")), [properties]);
  return <div>
      {description && <p>{formatDescription(description)}</p>}

      {visibleProps.map(([name, prop]) => {
    const isRequired = requiredProps.includes(name);
    const hasDefault = prop.default !== undefined;
    const rawDefault = prop.default;
    const isComplexDefault = hasDefault && (typeof rawDefault === "object" || typeof rawDefault === "string" && (rawDefault.length > 20 || rawDefault.includes('"')));
    const fieldProps = {
      key: name,
      body: prop.title || name,
      type: prop.type,
      ...prop.title && ({
        post: [<><span className="text-stone-400 dark:text-stone-500">API property: </span>{name}</>]
      }),
      ...isRequired && ({
        required: true
      }),
      ...!isComplexDefault && hasDefault ? {
        default: sanitize(String(rawDefault))
      } : {}
    };
    const isObject = prop.type === "object" && prop.properties;
    const isArrayOfObjects = prop.type === "array" && prop.items?.type === "object" && prop.items.properties;
    return <ParamField {...fieldProps}>
            {prop.description && <p>{formatDescription(prop.description)}</p>}

            {isComplexDefault && <div className="flex">
                <p>
                  <strong>Default:</strong>
                </p>
                <pre className="!my-0">
                  <code>
                    {JSON.stringify(rawDefault, null, 2)}
                  </code>
                </pre>
              </div>}

            {isArrayOfObjects && <div className="flex">
              <p>
                <strong>Object attributes:</strong>
              </p>
              <pre className="!my-0">
                <code>
                  {'{\n'}
                  {Object.entries(prop.items.properties).map(([iname, iprop]) => <>
                      {`  ${iname}`}
                      {prop.items?.required?.includes(iname) && <span style={{
      color: 'red'
    }}> required</span>}
                      {`: {\n    display name: ${sanitize(iprop.title || '')}\n    type: ${iprop.type}\n  }\n`}
                    </>)}
                  {'}'}
                </code>
              </pre>
              </div>}

            {isObject && <Expandable title="properties">
                <SchemaParamFields schema={{
      properties: prop.properties,
      required: prop.required
    }} />
              </Expandable>}
          </ParamField>;
  })}
    </div>;
};

export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => {
  const [isLoaded, setIsLoaded] = useState(false);
  useEffect(() => {
    const timer = setTimeout(() => {
      setIsLoaded(true);
    }, 500);
    return () => clearTimeout(timer);
  }, []);
  return <div className="lw-template-container">
      <Card title={title} icon={icon}>
        {isLoaded && <span dangerouslySetInnerHTML={{
    __html: `<lw-template id="a029c1a9-28be-427e-b0e1-5d918920246a"></lw-template
            >`
  }} />}
        <Link href={linkHref} className="agent-studio-link text-left text-gray-600 gap-2 dark:text-gray-400 text-sm font-medium flex flex-row items-center hover:text-primary dark:hover:text-primary-light group-hover:text-primary group-hover:dark:text-primary-light">Powered by Lucidworks Agent Studio</Link>
      </Card>
    </div>;
};

[localhost link]: http://localhost:3000/docs/fusion-connectors/connectors/v1/hadoop-apache2

[mintlify link]: https://doc.lucidworks.com/docs/fusion-connectors/connectors/v1/hadoop-apache2

[old doc.lw link]: https://doc.lucidworks.com/fusion-connectors/82

<Callout icon="plug" color="#A4C6F7" iconType="solid">
  **Compatible with Fusion version:** 4.0.0 through 4.2.6
</Callout>

<Note>
  Deprecation and removal notice

  This connector is deprecated as of Fusion 4.2 and is removed or expected to be removed as of Fusion 5.0.

  For more information about deprecations and removals, including possible alternatives, see [Deprecations and Removals](/docs/fusion-connectors/deprecations-and-removals).
</Note>

There is also a non-MapReduce enabled connector for HDFS filesystem; see the [HDFS Connector Configuration Reference](/docs/fusion-connectors/connectors/v1/hdfs) for details.

## How Hadoop connectors work

The Hadoop crawlers take full advantage of the scaling abilities of the MapReduce architecture and will use all of the nodes available in the cluster just like any other MapReduce job. This has significant ramifications for performance since it is designed to move a lot of content, in parallel, as fast as possible (depending on the system’s capabilities), from its raw state to the Fusion index. The Hadoop crawlers work in stages:

1. Create one or more [SequenceFiles](https://cwiki.apache.org/confluence/display/HADOOP2/SequenceFile) from the raw content. This can be done in one of two ways:
   * If the source files are available in a shared Hadoop filesystem, prepare a list of source files and their locations as a SequenceFile. The raw contents of each file are not processed until step 2.
   * If the source files are not available, prepare a list of source files and the raw content, stored as a [Behemoth](https://github.com/DigitalPebble/behemoth) document. This process is currently done sequentially and can take a significant amount of time if there is a large number of documents and/or if they are very large.
2. Run a MapReduce job to extract text and metadata from the raw content using Apache Tika. This is similar to the Fusion approach of extracting content from crawled documents, except it is done with MapReduce.
3. Run a MapReduce job to send the extracted content from HDFS to the index pipeline for further processing.

The first step of the crawl process converts the input content into a SequenceFile. In order to do this, the entire contents of that file must be read into memory so that it can be written out in the SequenceFile. Thus, you should be careful to ensure that the system does not load into memory a file that is larger than the Java heap size of the process. In certain cases, Behemoth can work with existing files such as SequenceFiles to convert them to Behemoth SequenceFiles. Contact Lucidworks for possible alternative approaches.

The processing approach is currently "all or nothing" when it comes to ingesting the raw content and all steps must be completed each time, regardless of whether the raw content has changed. Future versions may allow the crawler to restart from the SequenceFile conversion process. In the meantime, incremental crawling is not supported for this connector.

## Fusion login configuration file

The Fusion login config file required by the datasource configuration parameter `login_config` is a Java Authentication and Authorization Service (JAAS) configuration file which needs to be present on every mapper/reducer node which will inject data to Fusion.

Here is a sample file that describes the structure expected by Fusion:

```js wrap  theme={"dark"}
FusionClient {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 useTicketCache=false
 storeKey=false
 keyTab="/home/keytabs/hadoop.keytab";
};
```

`FusionClient` is the application name and can be set to anything. Be sure to set the `login_app_name` parameter to the same value if you change it. The other parameters can be configured as required and the `keyTab` value should point to the location on the node where the keytab file can be found.

<LwTemplate />

## Learn more

<Accordion title="Configure the Hadoop Client">
  The Apache Hadoop 2 Connector is a MapReduce-enabled crawler that is compatible with [Apache Hadoop](http://hadoop.apache.org/) v2.x.

  The connector services must be able to access the Hadoop client in file `$HADOOP_HOME/bin/hadoop`, so it must either be installed on one of the nodes of the Hadoop cluster (such as the `nameNode`), or a client supported by your specific distribution must be installed on the same server as the connectors. The Hadoop client must be configured to access the Hadoop cluster so the crawler is able to access the Hadoop cluster for content processing.

  <Note>Instructions for setting up any of the supported Hadoop distributions is beyond the scope of this document. We recommend reading one of the many tutorials found online or one of the books on Hadoop.</Note>

  This connector writes to the `hadoop.tmp.dir and the` `/tmp` directory in HDFS, so Fusion should be started by a user who has read/write permissions for both.

  ## Permission issues

  Using any flavor of Hadoop, you will need to be aware of the way Hadoop and systems based on Hadoop (such as CDH, MapR, etc.) handle permissions for services that communicate with other nodes.

  Hadoop services execute under specific user credentials: a quadruplet consisting of user name, group name, numeric user id, numeric group id. Installations that follow the manual usually use user 'mapr' and group 'mapr' (or similar), with numeric ids assigned by the operating system (e.g., uid=1000, gid=20). When the system is configured to enforce user permissions (which is the default in some systems), any client that connects to Hadoop services has to use a quadruplet that exists on the server. This means that ALL values in this quadruplet must be equal between the client and the server, i.e., an account with the same user, group, uid, and gid must exist on both client and server machines.

  When a client attempts to access a resource on Hadoop filesystems (or the `JobTracker`, which also uses this authentication method) it sends its credentials, which are looked up on the server, and if an exactly matching record is found then those local permissions will be used to determine read/write access. If no such account is found then the user is treated as "other" in the sense of the permission model.

  This means that the crawlers for the HDFS data source should be able to crawl Hadoop or MapR filesystems without any authentication, as long as there is a read (and execute for directories) access for "other" users granted on the target resources. Authenticated users will be able to access resources owned by their equivalent account.

  However, the Hadoop crawling described on this page require write access to a `/tmp` directory to use as a working directory. In many cases, this directory does not exist, or if it does, it does not have write access to "other" (not authenticated) users. Therefore users of these data sources should make sure that there is a `/tmp` directory on the target filesystem that is writable using their local user credentials, be it a recognized user, group, or "other". If a local user is recognized by the server then it is enough to create a `/tmp` directory that is owned by that user. If there is no such user, then the `/tmp` directory must be modified to have write permissions for "other" users. The working directory can be modified to be another directory that can be used for temporary working storage that has the correct permissions.

  ## Configuration for a Kerberos Hadoop cluster

  Kerberos is a system that provides authenticated access for users and services on a network. Instead of sending passwords in plaintext over the network, encrypted passwords are used to generate time-sensitive tickets which are used for authentication. Kerberos uses symmetric-key cryptography and a trusted third party called a Key Distribution Center (KDC) to authenticate users to a suite of network services. When a user authenticates to the KDC, the KDC sends a set of credentials (a ticket) specific to that session back to the user’s machine.

  To work with a Kerberized Hadoop cluster you must have a set of credentials. These are generated by running the "kinit" program. The datasource can be configured to run this program, in which case, the following information must be specified: the full path to the program, the Kerberos principal name, the location of a keytab file and the name of the file in which to store the ticket.
</Accordion>

## Configuration

<Tip>
  When entering configuration values in the UI, use *unescaped* characters, such as `\t` for the tab character. When entering configuration values in the API, use *escaped* characters, such as `\\t` for the tab character.
</Tip>

<SchemaParamFields schema={schema} />
