Pipeline Configuration
The default XML processing provided by the Apache Tika Parser index stage extracts all text from an XML into a single document field calledcontent
.
This not only flattens the document contents, it loses all information about the containing
elements in the document.
To process XML documents using an XML Transformation stage, the index pipeline must have as its
initial processing stage an Apache Tika Parser index stage which is configured to pass the
document through to the XML Transformation stage as raw XML, via the following configuration:
- UI checkbox “Add original document content” unchecked / REST API property “addOriginalContent” set to false
- UI checkbox “Return parsed content as XML or HTML” checked / REST API property “keepOriginalStructure” set to true
- UI checkbox “Return original XML and HTML instead of Tika XML output” checked / REST API property “returnXml” set to true
body
.
The pipeline must have a Field Mapping stage after the XML Transformation stage, before the Solr Indexer stage. The Field Mapping stage is used to remove the following fields from the document:
- raw-content
- Content-Type
- Content-Length
- parsing
- parsing_time
XML Transforms
The XML Transformation stage uses a Solr XPathRecordReader which is a streaming XML parser that supports only a limited subset of XPath selectors. It provides exact matching on element attributes and it can only extract the element text, not attribute values. Examples of allowed XPath specifications where “a”, “b”, “c” are any element tags, likewise “attrName” is any attribute name:When specifying the list of
mappings
, for each mapping, the specification for the xpath
attribute must include the full path, i.e., the xpath
attribute will include the rootXPath
. See the example configuration below.Example Stage Specification
Definition of an XML-Transformation stage that extracts elements from a MEDLINE/Pubmed article abstract:Configuration
When entering configuration values in the UI, use unescaped characters, such as
\t
for the tab character. When entering configuration values in the API, use escaped characters, such as \\t
for the tab character.