Product Selector

Fusion 5.9
    Fusion 5.9

    XML Transformation Index Stage

    The XML Transformation stage (previously called the XML Transform Stage) allows you to process an XML document into one or more Solr documents and to specify mappings between elements and document fields. A common use case for an XML Transformation stage in a pipeline is when the XML document is a container-like document which contains a set of inner elements, each of which should be treated as a separate document. A parent ID field can be used to relate these multiple documents back to the containing document.

    Pipeline Configuration

    The default XML processing provided by the Apache Tika Parser index stage extracts all text from an XML into a single document field called content. This not only flattens the document contents, it loses all information about the containing elements in the document. To process XML documents using an XML Transformation stage, the index pipeline must have as its initial processing stage an Apache Tika Parser index stage which is configured to pass the document through to the XML Transformation stage as raw XML, via the following configuration:

    • UI checkbox "Add original document content" unchecked / REST API property "addOriginalContent" set to false

    • UI checkbox "Return parsed content as XML or HTML" checked / REST API property "keepOriginalStructure" set to true

    • UI checkbox "Return original XML and HTML instead of Tika XML output" checked / REST API property "returnXml" set to true

    With this configuration, the Tika parser stage decodes the raw input stream of bytes into a string containing the entire XML document which is returned in the PipelineDocument field body.

    The pipeline must have a Field Mapping stage after the XML Transformation stage, before the Solr Indexer stage. The Field Mapping stage is used to remove the following fields from the document:

    • raw-content

    • Content-Type

    • Content-Length

    • parsing

    • parsing_time

    XML Transforms

    The XML Transformation stage uses a Solr XPathRecordReader which is a streaming XML parser that supports only a limited subset of XPath selectors. It provides exact matching on element attributes and it can only extract the element text, not attribute values.

    Examples of allowed XPath specifications where "a", "b", "c" are any element tags, likewise "attrName" is any attribute name:

    /a/b/c
    /a/b/c[@attrName='someValue']
    /a/b/c[@attrName=]/d
    /a/b/c/@attrName
    //b//...
    When specifying the list of mappings, for each mapping, the specification for the xpath attribute must include the full path, i.e., the xpath attribute will include the rootXPath. See the example configuration below.

    Example Stage Specification

    Definition of an XML-Transformation stage that extracts elements from a MEDLINE/Pubmed article abstract:

    { "type" : "xml-transform",
      "id" : "n0j2a9k9",
      "rootXPath" : "/MedlineCitationSet/MedlineCitation",
      "bodyField" : "body",
      "mappings" : [ {
          "xpath" : "/MedlineCitationSet/MedlineCitation/Article/ArticleTitle",
          "field" : "article-title_txt",
          "multivalue" : false
      }, {
          "xpath" : "/MedlineCitationSet/MedlineCitation/Article/Abstract/AbstractText",
          "field" : "article-abstract_txt",
          "multivalue" : true
      }, {
          "xpath" : "/MedlineCitationSet/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName",
          "field" : "mesh-heading_txt",
          "multivalue" : true
      }, {
          "xpath" : "/MedlineCitationSet/MedlineCitation/PMID",
          "field" : "pmid_txt",
          "multivalue" : false
      } ],
      "keepParent" : false,
      "skip" : false,
      "label" : "medline_xml_transform",
    }

    Template for a minimal index pipeline that includes an XML-Transformation stage. Replace the XPath and field names in the XML-Transformation stage according to your data.

    {
        "id" : "xml-pipeline-default",
        "stages" : [ {
        "type" : "tika-parser",
        "includeImages" : false,
        "flattenCompound" : false,
        "addFailedDocs" : false,
        "addOriginalContent" : false,
        "contentField" : "_raw_content_",
        "returnXml" : true,
        "keepOriginalStructure" : true,
        "extractHtmlLinks" : false,
        "extractOtherLinks" : false,
        "csvParsing" : false,
        "skip" : false,
        "label" : "tika",
        "sourceField" : "_raw_content_"
        }, {
        "type" : "xml-transformation",
        "rootXPath" : "/ROOTS/ROOT",
        "bodyField" : "body",
        "mappings" : [ {
            "xpath" : "/ROOTS/ROOT/element",
            "field" : "element-field_t",
            "multivalue" : false
        } ],
        "keepParent" : false,
        "skip" : false,
        "label" : "xml"
        }, {
        "type" : "field-mapping",
        "mappings" : [ {
            "source" : "parsing",
            "operation" : "delete"
        }, {
            "source" : "parsing_time",
            "operation" : "delete"
        }, {
            "source" : "Content-Type",
            "operation" : "delete"
        }, {
            "source" : "Content-Length",
            "operation" : "delete"
        } ],
        "skip" : false,
        "label" : "field mapping"
        }, {
        "type" : "solr-index",
        "enforceSchema" : true,
        "bufferDocsForSolr" : false,
        "skip" : false,
        "label" : "solr-index"
        } ]
    }

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    This stage transforms XML contained in a given field on the input document to a new pipeline document with the extracted fields that match the XPath query. Both the field and the value mapping rules may contain XPath expressions. If both the field and the value contain more than one XPath expression, the stage assumes they are paired.

    skip - boolean

    Set to true to skip this stage.

    Default: false

    label - string

    A unique label for this stage.

    <= 255 characters

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    rootXPath - stringrequired

    All relative XPath mappings will be made relative to this path.

    splitOnRoot - boolean

    If true and there are more than one matches for the root XPath, then each root match will create a new document. Defaults to true for backwards compatibility reasons.

    Default: true

    parentIdField - string

    Add the parent document's ID onto the new document under this field name.

    bodyField - string

    The field containing the XML document to process.

    Default: body

    outputXMLFragments - boolean

    If true, then XPath matches that result in a node selection will output the whole node as a String. If false, just the text content of the node will be output.

    Default: false

    mappings - array[object]

    The XPath rules to apply to extract content from the designated Body field. Extractions are added on to the document.

    object attributes:{xpath required : {
     display name: Value Expression
     type: string
    }
    field required : {
     display name: Field Expression
     type: string
    }
    multivalue : {
     display name: Multi Value
     type: boolean
    }
    }

    metadata - array[object]

    Pass in any additional key/value pairs to be added to the document.

    object attributes:{field required : {
     display name: Field
     type: string
    }
    value required : {
     display name: Value
     type: string
    }
    }

    keepParent - boolean

    If true, keep the parent document. If false, the content extracted from the body field will create one or more new documents.

    Default: false