Apache Tika Parser Stage
|This stage is deprecated. Please use Forked Apache Tika Parser Stage instead.|
Apache Tika is a versatile parser that supports many types of unstructured document formats, such as HTML, PDF, Microsoft Office documents, OpenOffice, RTF, audio, video, images, and more. A complete list of supported formats is available at http://tika.apache.org/.
To perform image text extraction when Include images is enabled, Tesseract should be installed in the server hosting Fusion.
When entering configuration values in the UI, use unescaped characters, such as