Parsers

Parsers provide fine-grained configuration for inbound data. You configure parsers with stages, much like index pipelines and query pipelines. Parsers can include conditional parsing and nested parsing. You can configure them through the Fusion UI or the Parsers API.

Connectors receive the inbound data, convert it into a byte stream, and send the byte stream through a parser’s configured parsing stages. The stream moves through the parser stage by stage until it has been successfully parsed, and then it proceeds to the index pipeline.

Each parsing stage evaluates whether the inbound stream matches the stage’s default media type or filename extension. The first stage that finds a match can output one or both of the following:

  • Zero or more pipeline documents for consumption by the index pipeline

  • Zero or more new input streams for re-parsing

    This recursive approach is useful for containers (for example, zip and tar files). The output of the container parsing can be another container or a stream of uncompressed content that requires its own parsing.

A few static fields impact the overall parser configuration. They are accessible when you select the parser in the Index Workbench:

Field Description

Document ID Source Field

Field in the source file that contains the document ID

Maximum Parser Recursion Depth

Maximum number of times the parser may recurse over the file, before proceeding to the next parser. This is useful for files with hierarchical structures (for example, zip and tar files).

Enable automatic media type detection

Whether to automatically detect the media type of the source files. If disabled, the parser uses the media type application/octet-stream.

Built-in parser stages

These parser stages are available for configuration:

Datasources that use connectors that retrieve fixed-structure content (such as those for Twitter and Jira) have hard-coded parsers and do not expose any configurable parser options.

Configure parsers

When you configure a datasource, you can use the Index Workbench or the Parsers API to create a parser. A parser consists of an ordered list of parser stages, some global parser parameters, and the stage-specific parameters. You can re-order the stages list by dragging them up or down in the Index Workbench.

Any parser stage can be added to the same parser multiple times if different configuration options are needed for different stages. Datasources with fixed-structure data will also be parsed by Fusion, but with default settings that do not need to be customized.

There is no limit to the number of stages that can be included in a parser. The order in which they run is also completely flexible and can be linear or recursive. When the end of the parsing sequence is reached, a default parser stage automatically attempts to parse anything that has not yet been matched.

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Configure a parser in the Fusion UI

To configure parser stages using the Fusion UI:

  1. In the Fusion workspace, navigate to the Index Workbench.

  2. At the upper right of the Index Workbench panel, click Load.

  3. Under Load, click the name of the index pipeline.

  4. Click the parser to open its configuration:

    Parser

  5. Click a specific stage to open its configuration panel:

    Parser stage configuration

Configure a parser in the REST API

The Parsers API provides a programmatic interface for viewing, creating, and modifying parsers, as well as sending documents directly to a parser.

  • To get all currently-defined parsers: http://localhost:8764/api/parsers/

  • To get the parser schema: http://localhost:8764/api/parsers/_schema

Here’s a very simple parser example, for parsing JSON input:

{ "id": "simple-json",
  "type": "json",
             arbitrary parser-specific options here.
        "prettify": false
}

The example below shows a parser that can parse JSON input, as well as JSON that is inside zip, tar, or gzip containers, or any combination (such as .tar.gz). The order of the stages begins with the outermost containers and ends with the innermost content.

{ "id": "default-json",
  "type": "composite",
  "parsers": [
    { "id" : "zip-parser",
      "type" : "zip" },
    { "type" : "gz" },
    { "type" : "tar" },
    { "id": "json-parser",
      "type": "json",
      "prettify": false
    }
}

ID is optional, just as in pipeline stages. Many parser stages require no configuration other than type.

Field parser index pipeline stage

The parsers themselves only parse whole documents. Parsing of content embedded in fields is performed separately by the Field Parser index pipeline stage. This stage identifies the field that requires parsing and the parser to use.