HTML Parser Stage

This parser stage processes the following HTML elements:

  • <title>

  • <body> (with tags removed)

  • <meta>

  • <a> and <link>

Additionally, you can configure JSoup selectors to extract specific HTML and CSS elements from a document and map them to PipelineDocument fields. For example, you could use this to process navigational DIV elements one way, then process content-ful DIV elements another way.

HTML and CSS elements can be selected for extraction into new documents or fields:

  • To create new documents from selected elements, configure recordSelector.

  • To create new fields from selected elements, configure mappings.

Title, body, metadata, and links are only populated in the parent document. Both of these parameters support JSoup selectors, which provides a rich syntax for selecting HTML and CSS elements.

Note
The HTML Transformation index pipeline stage is deprecated in favor of this parser stage.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Global configuration

These configuration options apply to the parser as a whole.

Property Description

idField

A document field to use as the document ID.

enableMediaTypeDetection

Automatically detect the Content-Type of each document; disable this to use application/octet-stream.

maxParserDepth

Maximum number of times a stage may recurse over any document before proceeding to the next stage.

HTML parser stage configuration

Property Description

mediaTypes

An array of types for this parser, which must match the pattern: ^\\/[^\\/]$

pathPatterns

Specify a file name or pattern that must be matched for this parser to run. Forward slashes (/) are used to join names of files inside archives with the archive name.

syntax - One of "glob" or "regex".

pattern - The filename or pattern to match.

Glob examples: z.txt or *.md or /a/*/b/f.txt

Regex examples: z.txt$ or .*\.txt$ or ^/a/[^\/]*/b/f.txt$

inheritMediaTypes

"True" to inherit acceptable types from the parser.

errorHandling

One of the following:

ignore - Ignore errors, drop the current record, and continue parsing the next record or document.

log - Log errors, drop the current record, and continue parsing the next record or document.

fail - Generate an exception and stop parsing.

mark (default) - Create a marker document that is emitted instead of the bad record. The error document contains common metadata gathered so far, plus error message and error class. The parser may also add more details about the error condition.

charset

required

The default is detect, to auto-detect the character set.

recordSelector

Create child documents from each HTML file, using JSoup selectors. Only one selector may be configured at once, but a selector may be complex.

mappings

Extract parts of the document into new fields, using JSoup selectors.

If mappings is configured and recordSelector is not, then additional metadata (if configured; see below) is populated only in the parent document.

keepParent

"True" to keep the parent document after child documents are created. This property no effect if recordSelector is not specified.

extractHtmlLinks

Collect links explicitly declared in document structure (for example, using HTML tags, bookmarks, and so on); default true.

extractBodyText

Extract the content of all elements in the <body> as a single text field.