Deprecation and removal noticeThis parser is deprecated as of Managed Fusion 5.8.0 and is expected to be removed in a later version. Use the asynchronous Tika parsing method instead. For more information, see Asynchronous Tika Parsing.
Apache Tika is a versatile parser that supports many types of unstructured document formats, such as HTML, PDF, Microsoft Office, OpenOffice, RTF, audio, video, images, and more. A complete list of supported formats is available at Apache Tika.This stage is not compatible with asynchronous Tika parsing.
ImportantTo perform image text extraction when Include images is enabled, contact your Lucidworks representative for details.
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
Parse Office documents (ppt/docx/pdf), HTML files, images (jpeg/tiff), and more. See “Supported Formats” at https://tika.apache.org/ for a full list. This stage is deprecated. Use ‘Apache Tika Container Parser’ instead. This stage doesn’t work with async-parsing.