CSV Parser Stage

CSV parser stage

This parser breaks down incoming CSV files into the most efficient components for Fusion to index. It produces one new document per row from the CSV input, excluding comment rows and header rows.

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

CSV stage-specific properties

Property Description, Type
autoDetect

Auto-detect CSV Format

Attempt to guess the delimiter, quote, quote escape, and comment characters

type: boolean

default value: 'true'

charset

Character Set

required

Example: "UTF-8"

type: string

default value: 'detect'

columnHandling

Column mismatch handling

What to do when a row has too many or too few columns: Can throw an error, align the column, or do nothing special (default)

type: string

default value: 'default'

enum: { error align default }

comment

Comment character

Character at start of row to indicate a comment, default is hash (#) if auto-detection is disabled

type: string

maxLength: 1

commentHandling

Comment Handling

How to handle comments: ignore, add as field to next document, or add a separate documents, default ignore

type: string

default value: 'ignore'

enum: { ignore as_field as_document }

delimiter

Delimiter

Delimiter character between fields. Any single character, including an escaped character, is valid, e.g. , (comma), \t (tab), or | (pipe). Default is comma if auto-detection is disabled

type: string

minLength: 1

emptyValue

Empty string replacement

A string value to replace empty strings with, no default

type: string

enabled

Enable this Parser Stage

type: boolean

default value: 'true'

errorHandling

Error Handling

type: string

default value: 'mark'

enum: { ignore log fail mark }

fillValue

Column fill value

A string value to use when aligning the columns (when Column Mismatch Handling is "align")

type: string

default value: ''

hasHeaders

Headers in file

Treat the first row as column headers, default true

type: boolean

default value: 'true'

headers

Header list

List of column headers, overrides file headers if present

type: array of string

id

Parser ID

type: string

default value: '5636bd33-32a1-4900-a59e-97bcd0596335'

ignoreBOM

Ignore BOM

required

Ignore Byte-Order Mark (BOM) if present and always use the configured character set. When set to false a valid BOM character set overrides the configured default character set.

type: boolean

default value: 'false'

includeRowNumber

Include row number

Include the row number (line number) in the emitted documents, default true

type: boolean

default value: 'true'

inheritMediaTypes

use default media types for this Parser Stage

Indicates if parser stage should use the default media types. Unchecking this box means that ONLY the manually configured media types will be parsed by the parser and you then MUST provide at least one media type.

type: boolean

default value: 'true'

lineSeparator

Line Separator

Line separator character

type: string

minLength: 1

maxColumnChars

Maximum number or characters per column

Maximum number of characters a single column value can have, default 10MB

type: integer

default value: '10485760'

exclusiveMaximum: false

exclusiveMinimum: false

maximum: 2147483647

minimum: 0

maxNumColumns

Maximum number of columns

Maximum number of columns to allow for a single row, default 1000

type: integer

default value: '1000'

exclusiveMaximum: false

exclusiveMinimum: false

maximum: 2147483647

minimum: 0

maxRowLength

Maximum line length

Maximum number of characters to allow for a single read line, default 10MB

type: integer

default value: '10485760'

exclusiveMaximum: false

exclusiveMinimum: false

maximum: 2147483647

minimum: 0

mediaTypes

Media Types for this Parser Stage

type: array of string

nullValue

Null value

A string value to replace nulls with, no default

type: string

outputFieldPrefix

Prefix parsed fields with

Fields extracted by this parser will be prefixed with this string. The remainder of the field name will be as detected in the stream

type: string

maxLength: 20

pattern: ^$|^[A-Za-z_][A-Za-z0-9_\-\.]+$

pathPatterns

File names to parse

Specify a file name or pattern that must be matched for this parser stage to run. Forward slashes ("/") are used to join names of files inside archives with the archive name.

type: array of object

object attributes: {
  pattern : {
    display name: File name or pattern
    type: string
    description : e.g.: "z.txt" or "*.md" or "/a/*/b/f.txt" for glob; "z.txt$" or ".*\.txt$" or "^/a/[^\/]*/b/f.txt$" for regex
    }
  syntax : {
    display name: Pattern type
    type: string
    default value: 'glob'
    description : glob uses bash shell-style wildcards; regex uses Java (PCRE-style) regex
    enum: { glob regex     }

    }
  }
quote

Quote

Quote character, default is a double quote (") if auto-detection is disabled

type: string

maxLength: 1

quoteEscape

Quote escape

Quote escape character, default is a double quote (") if auto-detection is disabled

type: string

maxLength: 1

skipEmptyLines

Skip empty lines

Skip any empty lines encountered, default true

type: boolean

default value: 'true'

trimWhitespace

Trim whitespace

Trim off leading and trailing whitespace from columns, default true

type: boolean

default value: 'true'

type

required

type: string

default value: 'csv'

enum: { csv }