Apache Tika is a versatile parser that supports many types of unstructured document formats, such as HTML, PDF, Microsoft Office documents, OpenOffice, RTF, audio, video, images, and more. A complete list of supported formats is available at http://tika.apache.org/.
To perform image text extraction when Include images is enabled, Tesseract should be installed in the server hosting Fusion.
Tip
|
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
|
Apache Tika stage-specific properties
Property | Description, Type |
---|---|
addFailedDocs
Add failed documents |
type: default value: ' |
addOriginalContent
Add original document content (raw bytes) |
type: default value: ' |
contentEncoding
Content transport encoding of the content (per RFC1341) |
type: default value: ' enum: { binary base64 } |
enabled
Enable this Parser Stage |
type: default value: ' |
errorHandling
Error Handling |
type: default value: ' enum: { ignore log fail mark } |
excludeContentTypes
Content types to exclude |
List of content types to exclude from parsing type: |
extractHtmlLinks
Extract XHTML links |
Collect links explicitly declared in document structure (e.g. using HTML tags, bookmarks, etc) type: default value: ' |
extractOtherLinks
Extract other links |
Use regex-based heuristic extractor to collect likely links from plain text content in all fields. type: default value: ' |
flattenCompound
Flatten compound documents |
type: default value: ' |
id
Parser ID |
type: default value: ' |
includeImages
Include images |
type: default value: ' |
inheritMediaTypes
Match default media types in this Parser Stage |
Each parser stage has a built-in list of media types it handles by default. If this setting is true, that list will be used along with any optional additional types provided in the mediaTypes list. If this setting is false, this stage will only be selected for media types in the mediaTypes list, and the mediaTypes list becomes a mandatory property which must have at least one valid media type. type: default value: ' |
keepOriginalStructure
Return original XML and HTML instead of Tika XML output (only applies if 'Return parsed content as XML is true') |
type: default value: ' |
mediaTypes
Media Types to match |
Documents with a media type on this list will be matched by this parser stage. See inheritMediaTypes / use default media types for more. type: |
outputFieldPrefix
Prefix parsed fields with |
Fields extracted by this parser will be prefixed with this string. The remainder of the field name will be as detected in the stream type: maxLength: 20 pattern: ^$|^[A-Za-z_][A-Za-z0-9_\-\.]+$ |
pathPatterns
File names to parse |
Specify a file name or pattern that must be matched for this parser stage to run. Forward slashes ("/") are used to join names of files inside archives with the archive name. type: object attributes: { } |
returnXml
Return parsed content as XML |
type: default value: ' |
type
required |
type: default value: ' enum: { tika } |
zipBombCompressionRatio
Maximum input-to-output byte ratio |
Maximum number of output bytes fusion will generate per input byte. If you are indexing highly compressed files, you may increase this value to avoid triggering 'Zip Bomb' detection type: default value: ' |
zipBombMaxDepth
Maximum nesting depth |
Returns the maximum XML element nesting level. If you are indexing highly nested files, you may increase this value to avoid triggering 'Zip Bomb' detection type: default value: ' |
zipBombMaxPackageEntryDepth
Maximum package entry depth |
Sets the maximum package entry nesting level. If you are indexing highly nested files, you may increase this value to avoid triggering 'Zip Bomb' detection type: default value: ' |