Apache Tika Parser Stage

Apache Tika is a versatile parser that supports many types of unstructured document formats, such as HTML, PDF, Microsoft Office documents, OpenOffice, RTF, audio, video, images, and more. A complete list of supported formats is available at http://tika.apache.org/.


When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Global configuration

These configuration options apply to the parser as a whole.

Property Description


A document field to use as the document ID.


Automatically detect the Content-Type of each document; disable this to use application/octet-stream.


Maximum number of times a stage may recurse over any document before proceeding to the next stage.

Apache Tika parser stage configuration

Property Description


One of the following:

ignore - Ignore errors, drop the current record, and continue parsing the next record or document.

log - Log errors, drop the current record, and continue parsing the next record or document.

fail - Generate an exception and stop parsing.

mark (default) - Create a marker document that is emitted instead of the bad record. The error document contains common metadata gathered so far, plus error message and error class. The parser may also add more details about the error condition.


Include images; default false.


"True" to flatten compound documents; default false.


Add failed documents; default false.


Add original document content (raw bytes); default true.


Content transport encoding (per RFC 1341), one of binary or base64; default binary.


"True" to return parsed content as XML (instead of HTML); default false.


"True" to return original XML and HTML instead of Tika XML output.


Collect links explicitly declared in document structure (for example, using HTML tags, bookmarks, and so on); default true.


Use regex-based heuristic extractor to collect likely links from plain text content in all fields; default false.


An array of content types to exclude from parsing.


An array of types for this parser, which must match the pattern: ^\\/[^\\/]$


Specify a file name or pattern that must be matched for this parser to run. Forward slashes (/) are used to join names of files inside archives with the archive name.

syntax - One of "glob" or "regex".

pattern - The filename or pattern to match.

Glob examples: z.txt or *.md or /a/*/b/f.txt

Regex examples: z.txt$ or .*\.txt$ or ^/a/[^\/]*/b/f.txt$


"True" to inherit acceptable types from the parser.