Product Selector

Fusion 5.12
    Fusion 5.12

    ParsersConfiguration specifications

    Parsers provide fine-grained configuration for inbound data. You configure parsers with stages, much like index pipelines and query pipelines. Parsers can include conditional parsing and nested parsing. You can configure them through the Managed Fusion UI or the Parsers API.

    Connectors receive the inbound data, convert it into a byte stream, and send the byte stream to a parser’s configured parsing stages. The parser selects a parsing stage to handle the stream, which parses the data and produces documents that are sent to the index pipeline.

    Each parsing stage evaluates whether the inbound stream matches the stage’s default media types or filename extensions. The first stage that finds a match processes the data and can output one or both of the following:

    • Zero or more pipeline documents for consumption by the index pipeline

    • Zero or more new input streams for re-parsing

      This recursive approach is useful for containers (for example, zip and tar files). The output of the container parsing can be another container or a stream of uncompressed content that requires its own parsing.

    Stages that might match the stream beyond the first match will not be used.

    A few static fields impact the overall parser configuration. They are accessible when you select the parser in the Index Workbench:

    Field Description

    Document ID Source Field

    Field in the source file that contains the document ID

    Maximum Parser Recursion Depth

    Maximum number of times the parser may recurse over the file, before proceeding to the next parser. This is useful for files with hierarchical structures (for example, zip and tar files).

    Enable automatic media type detection

    Whether to automatically detect the media type of the source files. If disabled, the parser uses the media type application/octet-stream.

    Built-in parser stages

    The parser stages found in the sidenav are available for configuration.

    Datasources that use connectors that retrieve fixed-structure content, such as those for Twitter and Jira, have hard-coded parsers and do not expose any configurable parser options.

    Configure parsers

    When you configure a datasource, you can use the Index Workbench or the Parsers API to create a parser. A parser consists of an ordered list of parser stages, some global parser parameters, and the stage-specific parameters. You can re-order the stages list by dragging them up or down in the Index Workbench.

    Any parser stage can be added to the same parser multiple times if different configuration options are needed for different stages. Datasources with fixed-structure data will also be parsed by Managed Fusion, but with default settings that do not need to be customized.

    There is no limit to the number of stages that can be included in a parser. The priority-order of the stages is completely flexible. In a default parser configuration, a fallback parser is provided at the end of the parsing stage list to handle streams no other stage matches. If present, this stage is selected and attempts to parse anything that has not yet been matched.

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    Configure a parser in the Managed Fusion UI

    To configure parser stages using the Managed Fusion UI:

    1. In the Managed Fusion workspace, navigate to the Index Workbench.

    2. At the upper right of the Index Workbench panel, click Load.

    3. Under Load, click the name of the index pipeline.

    4. Click the parser to open its configuration:

      Parser

    5. Click a specific stage to open its configuration panel:

      Parser stage configuration

    Configure a parser in the REST API

    The Parsers API provides a programmatic interface for viewing, creating, and modifying parsers, as well as sending documents directly to a parser.

    The async-parsing service provides a programmatic interface for viewing, creating, and modifying parsers, as well as sending documents directly to a parser.

    The Parsers API is deprecated in Fusion 5.12.
    • To get all currently-defined parsers: https://EXAMPLE_COMPANY.lucidworks.cloud/async-parsing/parsers/

    • To get the parser schema: https://EXAMPLE_COMPANY.lucidworks.cloud/async-parsing/_schema/parsers

    Replace EXAMPLE_COMPANY with the name provided by your Lucidworks representative.

    Here is a very simple parser example, for parsing JSON input:

    { "id": "simple-json",
      "type": "json",
                 arbitrary parser-specific options here.
            "prettify": false
    }

    The example below shows a parser that can parse JSON input, as well as JSON that is inside zip, tar, or gzip containers, or any combination (such as .tar.gz). The order of the stages begins with the outermost containers and ends with the innermost content.

    { "id": "default-json",
      "type": "composite",
      "parsers": [
        { "id" : "zip-parser",
          "type" : "zip" },
        { "type" : "gz" },
        { "type" : "tar" },
        { "id": "json-parser",
          "type": "json",
          "prettify": false
        }]
    }

    ID is optional, just as in pipeline stages. Many parser stages require no configuration other than type.

    Field parser index pipeline stage

    The parsers themselves only parse whole documents. Parsing of content embedded in fields is performed separately by the Field Parser index pipeline stage. This stage identifies the field that requires parsing and the parser to use.