HTML Transformation Index Stage

The HTML Transformation stage (called the HTML Transform stage in versions earlier than 3.0) is used to process HTML by means of a set of explicit mapping rules. This stage is usually used in tandem with an Apache Tika Parser stage; it provides custom processing of HTML content, instead of the Tika defaults. It uses the JSoup library which has a rich syntax for selecting HTML and CSS tags and elements. The JSoup selector patterns are used to map HTML elements to PipelineDocument fields. For example, you could process navigational div elements one way, and contentful div elements another way.

The HTML Transformation stage can be used to create multiple child records from HTML document fragments and relate them back to the parent record via a parent ID field. The Additional Metadata property provides the ability to add additional fields.

Required Pipeline Stages and Configuration

The pipeline must have a Tika Parser stage before the HTML Transformation stage. The Tika Parser must be configured as follows:

  • UI checkbox "Add original document content" / REST API property "addOriginalContent" set to false

  • UI checkbox "Return parsed content as XML or HTML" / REST API property "keepOriginalStructure" set to true

  • UI checkbox "Return original XML and HTML instead of Tika XML output" / REST API property "returnXml" set to true

For some versions of Fusion you may need to add a Field Mapping stage after the HTML Transformation stage to remove the following fields from the document:

  • _raw-content_

  • Content-Type

  • Content-Length

  • parsing

  • parsing_time

HTML Stage Configuration Example

Definition of an HTML Transformation stage to extract image links and text:

{
  "type": "html-transform",
  "recordSelector": "#main-content",
  "parentIdField": "page_s",
  "bodyField": "body",
  "mappings": [
   { "selectRule": "div",
     "attribute": "",
     "field": "main-content_txt",
     "multivalue": true
   },
   { "selectRule": "a",
     "attribute": "text",
     "field": "links_txt",
     "multivalue": true
   }  ],
  "keepParent": false,
  "skip": false,
  "label": "html-main-content"
}

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.