OpenNLP NER ExtractionIndex pipeline stage configuration specifications
Named Entity Recognition (NER) is the task of finding the names of persons, organizations, locations, and/or things in a passage of free text. The OpenNLP NER Extraction index stage (previously called the OpenNLP NER Extractor stage) uses a set of rules to find named entities in a field in the Pipeline Document (the "source") and populates a new field (the "target") with these entities.
This stage is deprecated in favor of SpaCy and Seldon Core functionality and removed in Managed Fusion 5.7. SpaCy offers better accuracy and detects more entities than OpenNLP. |
This stage uses Apache OpenNLP project’s Named Entity Recognition tool (the Name Finder tool). The OpenNLP documentation states:
The Name Finder tool can detect named entities and numbers in text. To be able to detect entities the Name Finder needs a model. The model is dependent on the language and entity type it was trained for. The OpenNLP projects offers a number of pre-trained name finder models which are trained on various freely available corpora. They can be downloaded at our model download page. To find names in raw text the text must be segmented into tokens and sentences.
Managed Fusion contains a common set of NER models for English that include sentence, token, and part-of-speech models. These models are:
Model | Purpose |
---|---|
|
Sentence model to detect sentences |
|
Tokenizer model for tokenization of sentences |
|
Date name finder model |
|
Location name finder model |
|
Money name finder model |
|
Organization name finder model |
|
Percentage name finder model |
|
Person name finder model |
|
Time name finder model |
See OpenNLP 1.5 series for additional pre-trained OpenNLP models. |
To use these models, upload to Managed Fusion using the Managed Fusion Blob Store service. Here is an example of how to upload the sentence model file using the curl
command-line utility, where "admin" is the name of a user with admin privileges, and "pass" is the password:
curl -u USERNAME:PASSWORD -X PUT --data-binary @data/nlp/models/en-sent.bin -H 'Content-type: application/octet-stream' http://EXAMPLE_COMPANY.lucidworks.cloud/api/blobs/en-sent.bin
Replace EXAMPLE_COMPANY with the name provided by your Lucidworks representative.
|
See Natural Language Processing for more information.
Example Specification
Specification of a stage which extracts names of people and places from field named 'body':
{
"type":"nlp-extractor",
"id":"iqtr",
"rules":[
{
"source":[
"body_t"
],
"target":"organizations",
"writeMode":"append",
"sentenceModelLocation":"nlp/models/en-sent.bin",
"tokenizerModelLocation":"nlp/models/en-token.bin",
"entityTypes":[
{
"name":"organization",
"definition":"nlp/models/en-ner-organization.bin"
}
]
},
{
"source":[
"body_t"
],
"target":"persons",
"writeMode":"append",
"sentenceModelLocation":"nlp/models/en-sent.bin",
"tokenizerModelLocation":"nlp/models/en-token.bin",
"entityTypes":[
{
"name":"person",
"definition":"nlp/models/en-ner-person.bin"
}
]
}
],
"type":"nlp-extractor",
"skip":false,
"label":"Extract Entities",
"licensed":true,
"secretSourceStageId":"iqtr"
}
Configuration
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
|