Text Parser Stage

The Plain Text parser can split a text file by lines or consume it into a single document.

Options for treatment of this filetype include:

  • Plain Text Parser Fields

  • Number of header rows to skip

  • Split on line end or not

  • Comment character

  • Skip empty lines

  • Charset

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Global configuration

These configuration options apply to the parser as a whole.

Property Description

idField

A document field to use as the document ID.

enableMediaTypeDetection

Automatically detect the Content-Type of each document; disable this to use application/octet-stream.

maxParserDepth

Maximum number of times a stage may recurse over any document before proceeding to the next stage.

Text parser stage configuration

Property Description

errorHandling

One of the following:

ignore - Ignore errors, drop the current record, and continue parsing the next record or document.

log - Log errors, drop the current record, and continue parsing the next record or document.

fail - Generate an exception and stop parsing.

mark (default) - Create a marker document that is emitted instead of the bad record. The error document contains common metadata gathered so far, plus error message and error class. The parser may also add more details about the error condition.

charset

required

The default is detect, to auto-detect the character set.

ignoreBOM

required

Ignore Byte-Order Mark (BOM) if present and always use the configured character set. When set to false, a valid BOM character set overrides the configured default character set.

splitLines

Split text into lines to create multiple records; default false.

skipHeaderLines

Skip a number of header lines; default 0.

trimWhitespace

Trim off leading and trailing whitespace from lines; default false.

skipEmptyLines

Skip any empty lines encountered; default false.

outputField

Name of the output field where text is stored; default body.

maxLength

Maximum number of characters to allow for the body, -1 for unlimited; default 1MB.

maxLineLength

Maximum number of characters to allow for any single line, default 1MB.

commentField

Name of the output field where comment is stored; default comment.

comment

Characters at start of line to indicate a comment; default # (hash).

commentHandling

How to handle comments, one of the following:

ignore - Ignore comments and remove them from the text.

include - Include comments as-is (default).

as_field - Add comments as a field.

mediaTypes

An array of types for this parser, which must match the pattern: ^\\/[^\\/]$

pathPatterns

Specify a file name or pattern that must be matched for this parser to run. Forward slashes (/) are used to join names of files inside archives with the archive name.

syntax - One of "glob" or "regex".

pattern - The filename or pattern to match.

Glob examples: z.txt or *.md or /a/*/b/f.txt

Regex examples: z.txt$ or .*\.txt$ or ^/a/[^\/]*/b/f.txt$

inheritMediaTypes

"True" to inherit acceptable types from the parser.