HDFS V1Connector Configuration Reference

Table of Contents

Configuration

Hadoop Distributed File System (HDFS). It traverses the Hadoop file system as it would a regular Unix filesystem.

V1 deprecation and removal notice

Starting in Fusion 5.12.0, all V1 connectors are deprecated. This means they are no longer being actively developed and will be removed in Fusion 5.13.0.

The replacement for this connector is in active development at this time and will be released at a future date.

If you are using this connector, you must migrate to the replacement connector or a supported alternative before upgrading to Fusion 5.13.0. We recommend migrating to the replacement connector as soon as possible to avoid any disruption to your workflows.

See also the Hadoop connectors, which use MapReduce to distribute the crawl processes. For Fusion 4.x, see the following connectors reference pages:

When there is a lot of content to process, these MapReduce-enabled connectors will be significantly faster.

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Connector for HDFS or similar Hadoop filesystems.

id - stringrequired

Unique name for this datasource.

>= 1 characters

Match pattern: ^[a-zA-Z0-9_-]+$

pipeline - stringrequired

Name of an existing index pipeline for processing documents.

>= 1 characters

description - string

Optional description for this datasource.

parserId - string

Parser used when parsing raw content. For some connectors, a configuration to 'retry' parsing if an error occurs is available as an advanced setting

properties - Properties

Datasource configuration properties

db - Connector DB

Type and properties for a ConnectorDB implementation to use with this datasource.

type - string

Fully qualified class name of ConnectorDb implementation.

>= 1 characters

Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb

inlinks - boolean

Keep track of incoming links. This negatively impacts performance and size of DB.

Default: false

aliases - boolean

Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.

Default: false

inv_aliases - boolean

Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.

Default: false

url - string

The start link is the URI for the HDFS connector. The `namenode` is 8020 by default, e.g., hdfs://hmbrnasnfs2.highmark.com:8020/datalake/hm/phi/refined/UnstructuredData/NEPA

>= 1 characters

Match pattern: .*:.*

max_docs - integer

Maximum number of documents to fetch. The default (-1) means no limit.

>= -1

exclusiveMinimum: false

Default: -1

max_bytes - integer

Maximum size (in bytes) of documents to fetch or -1 for unlimited file size.

>= -1

exclusiveMinimum: false

Default: 10485760

index_directories - boolean

Set to true to add directories to the index as documents. If set to false, directories will not be added to the index, but they will still be traversed for documents.

Default: false

max_threads - integer

The maximum number of threads to use for fetching data. Note: Each thread will create a new connection to the repository, which may make overall throughput faster, but this also requires more system resources, including CPU and memory.

Default: 1

add_failed_docs - boolean

Set to true to add documents even if they partially fail processing. Failed documents will be added with as much metadata as available, but may not include all expected fields.

Default: false

crawl_item_timeout - integer

Time in milliseconds to fetch any individual document.

exclusiveMinimum: true

Default: 600000

maximum_connections - integer

Maximum number of concurrent connections to the filesystem. A large number of documents could cause a large number of simultaneous connections to the repository and lead to errors or degraded performance. In some cases, reducing this number may help performance issues.

Default: 1000

initial_mapping - Initial field mapping

Provides mapping of fields before documents are sent to an index pipeline.

skip - boolean

Set to true to skip this stage.

Default: false

label - string

A unique label for this stage.

<= 255 characters

condition - string

Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

reservedFieldsMappingAllowed - boolean

Default: false

retentionMappings - array[object]

Fields that should be kept or deleted

Default:

object attributes:{field required : {
display name: Field
type: string
}operation : {
display name: Operation
type: string
}}

updateMappings - array[object]

Values that should be added to or set on a field. When a value is added, any values previously on the field will be retained. When a value is set, any values previously on the field will be overwritten.

Default:

object attributes:{field required : {
display name: Field
type: string
}value required : {
display name: Value
type: string
}operation : {
display name: Operation
type: string
}}

translationMappings - array[object]

Fields that should be moved or copied to another field. When a field is moved, the values from the source field are moved over to the target field and the source field is removed. When a field is copied, the values from the source field are copied over to the target field and the source field is retained.

Default: {"source":"fetch_time","target":"fetch_time_dt","operation":"move"}{"source":"ds:description","target":"description","operation":"move"}

object attributes:{source required : {
display name: Source Field
type: string
}target required : {
display name: Target Field
type: string
}operation : {
display name: Operation
type: string
}}

unmappedRule - Unmapped Fields

Fields not mapped by the above rules. By default, any remaining fields will be kept on the document.

keep - boolean

Keep all unmapped fields

Default: true

delete - boolean

Delete all unmapped fields

Default: false

fieldToMoveValuesTo - string

Move all unmapped field values to this field

fieldToCopyValuesTo - string

Copy all unmapped field values to this field

valueToAddToUnmappedFields - string

Add this value to all unmapped fields

valueToSetOnUnmappedFields - string

Set this value on all unmapped fields

converter - string

Fully-qualified classname for a custom converter to produce valid SolrInputDocuments extracted from Hadoop Sequence or MapReduce files.

with_kerberos - boolean

Set to true to use Kerberos to authenticate to HDFS for access to the content.

Default: false

kerberos_user - string

Kerberos principal name, i.e., 'username@YOUR-REALM.COM'.

kerberos_keytab - string

Full path to the Kerberos keytab file.

crawl_depth - integer

Number of levels in a directory or site tree to descend for documents.

>= -1

exclusiveMinimum: false

Default: -1

bounds - string

Limits the crawl to a specific directory sub-tree, hostname or domain.

Default: tree

Allowed values: treehostdomainnone

include_paths - array[string]

Regular expressions for URI patterns to include. This will limit this datasource to only URIs that match the regular expression.

exclude_paths - array[string]

Regular expressions for URI patterns to exclude. This will limit this datasource to only URIs that do not match the regular expression.

include_extensions - array[string]

List the file extensions to be fetched. Note: Files with possible matching MIME types but non-matching file extensions will be skipped. Extensions should be listed without periods, using whitespace to separate items (e.g., 'pdf zip').

commit_on_finish - boolean

Set to true for a request to be sent to Solr after the last batch has been fetched to commit the documents to the index.

Default: true

verify_access - boolean

Set to true to require successful connection to the filesystem before saving this datasource.

Default: true