Parsers

Parsers were introduced in Fusion 3.0 to provide more fine-grained configuration for inbound data. Parsers are configured in stages, much like index pipelines and query pipelines. They can include conditional parsing and nested parsing, and can be configured via the Fusion UI or the Parsers API.

Connectors receive the inbound data, convert it into a byte stream, and send the byte stream through the configured parsing stages. The stream moves through the parser stage by stage until it has been successfully parsed, then proceeds to the index pipeline.

Each parsing stage evaluates whether the inbound stream matches the stage’s default media type or filename extension. The first stage that finds a match can output one or both of the following:

  • Zero or more pipeline documents for consumption by the index pipeline

  • Zero or more new input streams for re-parsing

    This recursive approach is useful for containers (zip or tar files, for example). The output of the container parsing may be another container or a stream of uncompressed content which requires its own parsing.

There are a few static fields that impact the overall configuration and are accessible whenever you have selected the parser in the Index Workbench:

  • Document ID Source Field

  • Enable Automatic Media Type Detection

  • Maximum Recursion Depth

Built-in parsing stages

These stages are available for configuration:

Datasources which use connectors that retrieve fixed-structure content (like Twitter or Jira) have hard-coded parsers and do not expose any configurable parser options.

HTML parser stage

This parser stage processes the following HTML elements:

  • <title>

  • <body> (with tags removed)

  • <meta>

  • <a> and <link>

Additionally, you can configure JSoup selectors to extract specific HTML and CSS elements from a document and map them to PipelineDocument fields. For example, you could use this to process navigational DIV elements one way, then process content-ful DIV elements another way.

See HTML parser stage for configuration details.

Note
The HTML Transformation index pipeline stage is deprecated in favor of this parser stage.

XML parser stage

The XML parser stage parses whole XML documents by default, but it can also be configured to parse only specific nodes without loading the entire document into memory. It can also split an XML document into multiple documents. XPATH-like expressions are used to select specific nodes to parse, such as /posts/row or /posts/record. Nested XML elements are flattened.

CSV parser stage

This parser breaks down incoming CSV files into the most efficient components for Fusion to index. It produces one new document per row from the CSV input, excluding comment rows and header rows.

See CSV parser stage for configuration details.

JSON parser stage

JSON parsing converts JSON content from a single document field into one or more new documents. This parser uses Solr’s JsonRecordReader to split JSON into sub-documents.

See JSON parser stage for configuration details.

Text parser stage

The Plain Text parser can split a text file by lines or consume it into a single document.

Options for treatment of this filetype include:

  • Plain Text Parser Fields

  • Number of header rows to skip

  • Split on line end or not

  • Comment character

  • Skip empty lines

  • Charset

See Text parser stage for configuration details.

Archive parser stage

The Archive parser stage can parse the majority of common archive and compressed file formats. They are parsed into their constituent documents, which can then be parsed further or sent straight to the index pipeline. The following archive formats are supported:

  • tar

  • zip

  • jar

  • 7z

  • ar

  • arj

  • Unix dump

  • cpio

See Archive parser stage for configuration details.

Apache Tika parser stage

Apache Tika is a versatile parser that supports many types of unstructured document formats, such as HTML, PDF, Microsoft Office documents, OpenOffice, RTF, audio, video, images, and more. A complete list of supported formats is available at http://tika.apache.org/.

See Apache Tika parser stage for configuration details.

Fallback parser stage

The Fallback parser stage is useful for processing data that Fusion does not have a specified parsing process for. Fallback does not technically parse data, since it does not know what to do with it, it simply copies the raw bytes into a Solr document. If your Fusion parser stage configuration encounters data it does not know how to parse, such as someone’s proprietary data file format, it will copy it as-is, whereas if it encounters recognizable data in more common file types, such as PDFs, Fusion will parse the text and metadata using Tika.

The Fallback parser acts as the final stage that attempts to parse any documents that haven’t been parsed already. When the correct parsing stage lands on the data, it executes accordingly.

See Fallback parser stage for configuration details.

Configuring parsers

When you configure a datasource, you can use the Index Workbench or the Parsers API to create a parser. A parser consists of an ordered list of parser stages, some global parser parameters, and the stage-specific parameters. You can re-order the stages list by dragging them up or down in the Index Workbench.

Any parser stage can be added to the same parser multiple times if different configuration options are needed for different stages. Datasources with fixed-structure data will also be parsed by Fusion, but with default settings that do not need to be customized.

There is no limit to the number of stages that can be included in a parser. The order in which they run is also completely flexible and can be linear or recursive. When the end of the parsing sequence is reached, a default parser stage automatically attempts to parse anything that has not yet been matched.

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Parser configuration in the Fusion UI

In the Search context, select a collection, then navigate to Home > Index Workbench and click the parser, usually called "datasource-name _parser". Clicking a specific stage opens its configuration panel.

Parser Config

Parser configuration in the REST API

The Parsers API provides a programmatic interface for viewing, creating, and modifying parsers, as well as sending documents directly to a parser.

Here’s a very simple parser example, for parsing JSON input:

{ "id": "simple-json",
  "type": "json",
             arbitrary parser-specific options here.
        "prettify": false
}

The example below shows a parser that can parse JSON input, as well as JSON that is inside zip, tar, or gzip containers, or any combination (such as .tar.gz). The order of the stages begins with the outermost containers and ends with the innermost content.

{ "id": "default-json",
  "type": "composite",
  "parsers": [
    { "id" : "zip-parser",
      "type" : "zip" },
    { "type" : "gz" },
    { "type" : "tar" },
    { "id": "json-parser",
      "type": "json",
      "prettify": false
    }
}

ID is optional, just as in pipeline stages. Many parser stages require no configuration other than type.

Parser index pipeline stage

The parsers themselves only parse whole documents. Parsing of content embedded in fields is performed separately by the Parser Index Pipeline Stage. This stage identifies the field or context that requires parsing, the appropriate parser to use, and what to do with the parsed content.