Apache Tika Parser Index Stage
The Apache Tika Parser index stage type includes rules for parsing documents with Apache Tika.
Fusion uses Tika v1.13.
(Note that components of the Solr distribution included with Fusion contian their own Tika jar files; these are not used by Fusion.)
Raw streams create new docs which Tika then tries to parse again. For this reason, avoid using the SDK connector or any other client that streams to the index pipeline.
|
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
|
A stage that uses Apache Tika for parsing rich document formats like PDF, Word, etc.
skip - boolean
Set to true to skip this stage.
Default: false
label - string
A unique label for this stage.
<= 255 characters
condition - string
Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.
includeImages - boolean
Default: false
flattenCompound - boolean
Default: false
addFailedDocs - boolean
Default: false
addOriginalContent - boolean
Default: false
contentField - string
Default: _raw_content_
contentEncoding - string
Default: binary
Allowed values: binarybase64
returnXml - boolean
Default: false
keepOriginalStructure - boolean
Default: false
extractHtmlLinks - boolean
Collect links explicitly declared in document structure (e.g. using HTML tags, bookmarks, etc)
Default: true
extractOtherLinks - boolean
Use regex-based heuristic extractor to collect likely links from plain text content in all fields.
Default: false
includeContentTypes - array[string]
List of content types to parse
excludeContentTypes - array[string]
List of content types to exclude from parsing
zipBombCompressionRatio - integer
Maximum number of output bytes fusion will generate per input byte. If you are indexing highly compressed files, you may increase this value to avoid triggering 'Zip Bomb' detection
Default: 200
zipBombMaxDepth - integer
Returns the maximum XML element nesting level. If you are indexing highly nested files, you may increase this value to avoid triggering 'Zip Bomb' detection
Default: 200
zipBombMaxPackageEntryDepth - integer
Sets the maximum package entry nesting level. If you are indexing highly nested files, you may increase this value to avoid triggering 'Zip Bomb' detection
Default: 20