Apache Tika Parser Index Stage

Table of Contents

Compatibilty Issues
Configuration

The Apache Tika Parser index stage type includes rules for parsing documents with Apache Tika. Fusion uses Tika v1.13. (Note that components of the Solr distribution included with Fusion contian their own Tika jar files; these are not used by Fusion.)

Compatibilty Issues

Raw streams create new docs which Tika then tries to parse again. For this reason, avoid using the SDK connector or any other client that streams to the index pipeline.

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

A stage that uses Apache Tika for parsing rich document formats like PDF, Word, etc.

skip - boolean

Set to true to skip this stage.

Default: false

label - string

A unique label for this stage.

<= 255 characters

condition - string

Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

includeImages - boolean

Default: false

flattenCompound - boolean

Default: false

addFailedDocs - boolean

Default: false

addOriginalContent - boolean

Default: false

contentField - string

Default: _raw_content_

contentEncoding - string

Default: binary

Allowed values: binarybase64

returnXml - boolean

Default: false

keepOriginalStructure - boolean

Default: false

extractHtmlLinks - boolean

Collect links explicitly declared in document structure (e.g. using HTML tags, bookmarks, etc)

Default: true

extractOtherLinks - boolean

Use regex-based heuristic extractor to collect likely links from plain text content in all fields.

Default: false

includeContentTypes - array[string]

List of content types to parse

excludeContentTypes - array[string]

List of content types to exclude from parsing

zipBombCompressionRatio - integer

Maximum number of output bytes fusion will generate per input byte. If you are indexing highly compressed files, you may increase this value to avoid triggering 'Zip Bomb' detection

Default: 200

zipBombMaxDepth - integer

Returns the maximum XML element nesting level. If you are indexing highly nested files, you may increase this value to avoid triggering 'Zip Bomb' detection

Default: 200

zipBombMaxPackageEntryDepth - integer

Sets the maximum package entry nesting level. If you are indexing highly nested files, you may increase this value to avoid triggering 'Zip Bomb' detection

Default: 20