Product Selector

Fusion 5.12
    Fusion 5.12

    Apache Hadoop 2 V1 Connector Configuration Reference

    Table of Contents

    The Apache Hadoop 2 Connector is a MapReduce-enabled crawler that is compatible with Apache Hadoop v2.x.

    Deprecation and removal notice

    This connector is deprecated as of Fusion 4.2 and is removed or expected to be removed as of Fusion 5.0.

    For more information about deprecations and removals, including possible alternatives, see Deprecations and Removals.

    There is also a non-MapReduce enabled connector for HDFS filesystem; see page HDFS Connector Configuration Reference for details.

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    Connector for using a Hadoop cluster to process documents and forward them to Solr for indexing. This uses a Hadoop job jar to pass arguments to Hadoop for processing with MapReduce.

    description - string

    Optional description for this datasource.

    id - stringrequired

    Unique name for this datasource.

    >= 1 characters

    Match pattern: ^[a-zA-Z0-9_-]+$

    pipeline - stringrequired

    Name of an existing index pipeline for processing documents.

    >= 1 characters

    properties - Properties

    Datasource configuration properties

    db - Connector DB

    Type and properties for a ConnectorDB implementation to use with this datasource.

    aliases - boolean

    Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.

    Default: false

    inlinks - boolean

    Keep track of incoming links. This negatively impacts performance and size of DB.

    Default: false

    inv_aliases - boolean

    Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.

    Default: false

    type - string

    Fully qualified class name of ConnectorDb implementation.

    >= 1 characters

    Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb

    fusion_batchsize - integer

    Fusion Client Batch Size

    >= 1

    exclusiveMinimum: true

    Default: 500

    fusion_buffer_timeoutms - integer

    Fusion Client Timeout (ms).

    >= 1

    exclusiveMinimum: true

    Default: 1000

    fusion_endpoints - array[string]

    Default: "http://localhost:8764"

    fusion_fail_on_error - boolean

    Fusion Client Fail on Error

    Default: false

    fusion_login_app_name - string

    Login Config App Name FusionClient by default.

    Default: FusionClient

    fusion_login_config - string

    The file path of Login Configuration for Fusion kerberized, it must be placed in every mapper/reduce node.

    fusion_password - string

    Fusion client User's password, leave empty if kerberos is use.

    fusion_realm - string

    Fusion's Realm, If 'native' is selected the password is mandatory. If 'kerberos' is selected the Login Configuration is mandatory.

    Default: NATIVE

    Allowed values: NATIVEKERBEROS

    fusion_user - string

    Fusion client's User or Principal if Kerberos is chosen.

    hadoop_home - string

    Path to the Hadoop home directory where $HADOOP_HOME/bin/hadoop can be found. The connector requires access to either a full Hadoop installation, or a Hadoop client provided by your Hadoop distribution that has been configured to access the Hadoop installation.

    >= 1 characters

    hadoop_input - string

    Hadoop input source file/directory

    >= 1 characters

    hadoop_mapper - string

    Hadoop Ingest Mapper

    Default: CSV

    Allowed values: CSVDIRECTORYGROKREGEXSEQUENCE_FILESOLR_XMLWARCZIP

    initial_mapping - Initial field mapping

    Provides mapping of fields before documents are sent to an index pipeline.

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    label - string

    A unique label for this stage.

    <= 255 characters

    mappings - array[object]

    List of mapping rules

    object attributes:{operation : {
     display name: Operation
     type: string
    }
    source required : {
     display name: Source Field
     type: string
    }
    target : {
     display name: Target Field
     type: string
    }
    }

    reservedFieldsMappingAllowed - boolean

    Default: false

    skip - boolean

    Set to true to skip this stage.

    Default: false

    unmapped - Unmapped Fields

    If fields do not match any of the field mapping rules, these rules will apply.

    operation - string

    The type of mapping to perform: move, copy, delete, add, set, or keep.

    Default: copy

    Allowed values: copymovedeletesetaddkeep

    source - string

    The name of the field to be mapped.

    target - string

    The name of the field to be mapped to.

    job_jar - string

    Path and name of the Hadoop job jar. Unless you are using a custom job jar, the default provided by Fusion is preferred.

    >= 1 characters

    Default: lucidworks-hadoop-job-2.2.7.jar

    kinit_cache - string

    Full path of 'kerberos' cache. If this path does not exist, it will be created.

    kinit_cmd - string

    Full path to the 'kinit' binary.

    Default: kinit

    kinit_keytab - string

    Full path to the Kerberos keytab file.

    kinit_principal - string

    Kerberos principal name, i.e., username@YOUR-REALM.COM

    mapper_args - array[object]

    Parameters for the Hadoop job.

    object attributes:{arg_name : {
     display name: name
     type: string
    }
    arg_value : {
     display name: value
     type: string
    }
    }

    reducers - integer

    (Expert) Depending on the OutputFormat and your system resources, you may wish to have Hadoop do a reduce step first so as to not open too many connections to the output resource

    exclusiveMinimum: false

    Default: 0

    run_kinit - boolean

    If your Hadoop installation requires job requests to authenticate with Kerberos, this option will allow Fusion to run 'kinit' to get a valid ticket.

    Default: false