XML Parser Stage

The XML parser stage parses whole XML documents by default, but it can also be configured to parse only specific nodes without loading the entire document into memory. It can also split an XML document into multiple documents. XPATH-like expressions are used to select specific nodes to parse, such as /posts/row or /posts/record. Nested XML elements are flattened.

To create new documents from selected elements, configure rootPaths.

Note
The XML Transformation index pipeline stage is deprecated in favor of this parser stage.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Global configuration

These configuration options apply to the parser as a whole.

Property Description

idField

A document field to use as the document ID.

enableMediaTypeDetection

Automatically detect the Content-Type of each document; disable this to use application/octet-stream.

maxParserDepth

Maximum number of times a stage may recurse over any document before proceeding to the next stage.

XML parser stage configuration

Property Description

mediaTypes

An array of types for this parser, which must match the pattern: ^\\/[^\\/]$

pathPatterns

Specify a file name or pattern that must be matched for this parser to run. Forward slashes (/) are used to join names of files inside archives with the archive name.

syntax - One of "glob" or "regex".

pattern - The filename or pattern to match.

Glob examples: z.txt or *.md or /a/*/b/f.txt

Regex examples: z.txt$ or .*\.txt$ or ^/a/[^\/]*/b/f.txt$

inheritMediaTypes

"True" to inherit acceptable types from the parser.

errorHandling

One of the following:

ignore - Ignore errors, drop the current record, and continue parsing the next record or document.

log - Log errors, drop the current record, and continue parsing the next record or document.

fail - Generate an exception and stop parsing.

mark (default) - Create a marker document that is emitted instead of the bad record. The error document contains common metadata gathered so far, plus error message and error class. The parser may also add more details about the error condition.

rootPaths

Read XML elements that can be found on specified XML paths and parse them into separate documents.

maxSize

The maximum size of a document.

listHandling

One of the following:

* multivalued + Create a single multivalued field containing all items. * index_numbered + Create a separate index-numbered field per list item.