CSV Parser Stage

This parser breaks down incoming CSV files into the most efficient components for Fusion to index. It produces one new document per row from the CSV input, excluding comment rows and header rows.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Global configuration

These configuration options apply to the parser as a whole.

Property Description

idField

A document field to use as the document ID.

enableMediaTypeDetection

Automatically detect the Content-Type of each document; disable this to use application/octet-stream.

maxParserDepth

Maximum number of times a stage may recurse over any document before proceeding to the next stage.

CSV parser stage configuration

Property Description

errorHandling

One of the following:

ignore - Ignore errors, drop the current record, and continue parsing the next record or document.

log - Log errors, drop the current record, and continue parsing the next record or document.

fail - Generate an exception and stop parsing.

mark (default) - Create a marker document that is emitted instead of the bad record. The error document contains common metadata gathered so far, plus error message and error class. The parser may also add more details about the error condition.

charset

required

The default is detect, to auto-detect the character set.

ignoreBOM

required

Ignore Byte-Order Mark (BOM) if present and always use the configured character set. When set to false, a valid BOM character set overrides the configured default character set.

delimiter

Delimiter character between fields. The following characters are valid:

, (comma)

\\t (tab)

` ` (space)

| (pipe)

^ (carat)

Default is comma if auto-detection is disabled.

quote

Quote character; default is a double quote (") if auto-detection is disabled.

quoteEscape

Quote escape character, default is a double quote (") if auto-detection is disabled.

autoDetect

Attempt to guess the delimiter, quote, quote escape, and comment characters.

trimWhitespace

Trim off leading and trailing whitespace from columns; default "true".

hasHeaders

Treat the first row as column headers; default "false".

headers

List of column headers, overrides file headers if present.

skipEmptyLines

Skip any empty lines encountered; default "true".

lineSeparator

Line separator character.

nullValue

A string value to replace nulls with; no default.

emptyValue

A string value to replace empty strings with; no default.

includeRowNumber

Include the row number (line number) in the emitted documents; default "true".

comment

Character at start of row to indicate a comment, default is hash (#) if auto-detection is disabled.

commentHandling

How to handle comments, one of the following:

ignore - Ignore all comments (default).

as_field - Add each comment as a field in the document.

as_document - Add each comment to a separate document.

maxRowLength

Maximum number of characters to allow for a single read line; default 10485760 (10MB).

maxNumColumns

Maximum number of columns to allow for a single row; default 1000.

maxColumnChars

Maximum number of characters a single column value can have; default 10485760 (10MB).

columnHandling

What to do when a row has too many or too few columns, do one of the following:

error - Throw an error.

align - Align the column.

default - Do nothing special.

fillValue

A string value to use when aligning the columns (when columnHandling is "align").

mediaTypes

An array of types for this parser, which must match the pattern: ^\\/[^\\/]$

pathPatterns

Specify a file name or pattern that must be matched for this parser to run. Forward slashes (/) are used to join names of files inside archives with the archive name.

syntax - One of "glob" or "regex".

pattern - The filename or pattern to match.

Glob examples: z.txt or *.md or /a/*/b/f.txt

Regex examples: z.txt$ or .*\.txt$ or ^/a/[^\/]*/b/f.txt$

inheritMediaTypes

"True" to inherit acceptable types from the parser.