Fallback Parser Stage

The Fallback parser stage is useful for processing data that Fusion does not have a specified parsing process for. Fallback does not technically parse data, since it does not know what to do with it, it simply copies the raw bytes into a Solr document. If your Fusion parser stage configuration encounters data it does not know how to parse, such as someone’s proprietary data file format, it will copy it as-is, whereas if it encounters recognizable data in more common file types, such as PDFs, Fusion will parse the text and metadata using Tika.

The Fallback parser acts as the final stage that attempts to parse any documents that haven’t been parsed already. When the correct parsing stage lands on the data, it executes accordingly.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Global configuration

These configuration options apply to the parser as a whole.

Property Description

idField

A document field to use as the document ID.

enableMediaTypeDetection

Automatically detect the Content-Type of each document; disable this to use application/octet-stream.

maxParserDepth

Maximum number of times a stage may recurse over any document before proceeding to the next stage.

Fallback parser stage configuration

Property Description

errorHandling

One of the following:

ignore - Ignore errors, drop the current record, and continue parsing the next record or document.

log - Log errors, drop the current record, and continue parsing the next record or document.

fail - Generate an exception and stop parsing.

mark (default) - Create a marker document that is emitted instead of the bad record. The error document contains common metadata gathered so far, plus error message and error class. Parsers may also add more details about the error condition.

includeImages

Include images; default false.

flattenCompound

"True" to flatten compound documents; default false.

addFailedDocs

Add failed documents; default false.

addOriginalContent

Add original document content (raw bytes); default true.

charset

required

The default is detect, to auto-detect the character set.

returnXml

"True" to return parsed content as XML (instead of HTML); default false.

keepOriginalStructure

"True" to return original XML and HTML instead of Tika XML output.

extractHtmlLinks

Collect links explicitly declared in document structure (for example, using HTML tags, bookmarks, and so on); default true.

extractOtherLinks

Use regex-based heuristic extractor to collect likely links from plain text content in all fields; default false.

excludeContentTypes

An array of content types to exclude from parsing.

mediaTypes

An array of types for this parser, which must match the pattern: ^\\/[^\\/]$

pathPatterns

Specify a file name or pattern that must be matched for this parser to run. Forward slashes (/) are used to join names of files inside archives with the archive name.

syntax - One of "glob" or "regex".

pattern - The filename or pattern to match.

Glob examples: z.txt or *.md or /a/*/b/f.txt

Regex examples: z.txt$ or .*\.txt$ or ^/a/[^\/]*/b/f.txt$

inheritMediaTypes

"True" to inherit acceptable types from the parser.