How To
Documentation
    Learn More

      Apache Tika Parser Stage

      This stage is deprecated. Please use Forked Apache Tika Parser Stage instead.

      Apache Tika is a versatile parser that supports many types of unstructured document formats, such as HTML, PDF, Microsoft Office documents, OpenOffice, RTF, audio, video, images, and more. A complete list of supported formats is available at http://tika.apache.org/.

      To perform image text extraction when Include images is enabled, Tesseract should be installed in the server hosting Fusion.

      When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

      Loading configuration schema...