OpenNLP NER Extraction Index Stage

Named Entity Recognition (NER) is the task of finding the names of persons, organizations, locations, and/or things in a passage of free text. The OpenNLP NER Extraction index stage (previously called the OpenNLP NER Extractor stage) uses a set of rules to find named entities in a field in the Pipeline Document (the "source") and populates a new fields (the "target") with these entities.

This stage uses Apache OpenNLP project’s Named Entity Recognition tool (the Name Finder tool). The OpenNLP documentation states

The Name Finder tool can detect named entities and numbers in text. To be able to detect entities the Name Finder needs a model. The model is dependent on the language and entity type it was trained for. The OpenNLP projects offers a number of pre-trained name finder models which are trained on various freely available corpora. They can be downloaded at our model download page. To find names in raw text the text must be segmented into tokens and sentences.

See this video tutorial for a demonstration of how to configure this stage:

Models are available from the OpenNLP models SourceForge repository.

The Fusion directory fusion/3.0.x/data/nlp contains a set of NER models for English, as well as sentence, token, and part-of-speech models.

Before they can be used, model files must be uploaded to Fusion using the Fusion Blob Store service via the REST API. Here is an example of how to upload the sentence model file from the fusion/3.0.x using the curl command-line utility, where admin is the name of a user with admin privileges, and pass is the password:

curl -u admin:pass -X PUT --data-binary @data/nlp/models/en-sent.bin -H 'Content-type: application/octet-stream' http://localhost:8764/api/apollo/blobs/en-sent.bin

Example Specification

Specification of a stage which extracts names of people and places from field named 'body':

{
    "type" : "nlp-extraction",
    "id" : "nd",
    "rules" : [ {
      "source" : [ "body_s" ],
      "target" : "content",
      "sentenceModel" : "en-sent.bin",
      "tokenizerModel" : "en-token-1.bin",
      "entityTypes" : [ {
        "name" : "organization",
        "definition" : "en-ner-organization-1.bin"
      }, {
        "name" : "person",
        "definition" : "en-ner-person-1.bin"
      } ]
    } ],
    "skip" : false
  } ]
}

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.