Product Selector

Fusion 5.9
    Fusion 5.9

    Apache Tika Parser Index Stage

    The Apache Tika Parser index stage type includes rules for parsing documents with Apache Tika. Fusion uses Tika v1.13. (Note that components of the Solr distribution included with Fusion contian their own Tika jar files; these are not used by Fusion.)

    Compatibilty Issues

    Raw streams create new docs which Tika then tries to parse again. For this reason, avoid using the SDK connector or any other client that streams to the index pipeline.

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    A stage that uses Apache Tika for parsing rich document formats like PDF, Word, etc.

    skip - boolean

    Set to true to skip this stage.

    Default: false

    label - string

    A unique label for this stage.

    <= 255 characters

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    includeImages - boolean

    Default: false

    flattenCompound - boolean

    Default: false

    addFailedDocs - boolean

    Default: false

    addOriginalContent - boolean

    Default: false

    contentField - string

    Default: _raw_content_

    contentEncoding - string

    Default: binary

    Allowed values: binarybase64

    returnXml - boolean

    Default: false

    keepOriginalStructure - boolean

    Default: false

    extractHtmlLinks - boolean

    Collect links explicitly declared in document structure (e.g. using HTML tags, bookmarks, etc)

    Default: true

    extractOtherLinks - boolean

    Use regex-based heuristic extractor to collect likely links from plain text content in all fields.

    Default: false

    includeContentTypes - array[string]

    List of content types to parse

    excludeContentTypes - array[string]

    List of content types to exclude from parsing

    zipBombCompressionRatio - integer

    Maximum number of output bytes fusion will generate per input byte. If you are indexing highly compressed files, you may increase this value to avoid triggering 'Zip Bomb' detection

    Default: 200

    zipBombMaxDepth - integer

    Returns the maximum XML element nesting level. If you are indexing highly nested files, you may increase this value to avoid triggering 'Zip Bomb' detection

    Default: 200

    zipBombMaxPackageEntryDepth - integer

    Sets the maximum package entry nesting level. If you are indexing highly nested files, you may increase this value to avoid triggering 'Zip Bomb' detection

    Default: 20