Tag Part-of-Speech Index Stage

The Tag Part-of-Speech Index stage (previously called the Part of Speech stage) operates over one of more fields in the Pipeline Document. It marks sentences with part of speech information as annotations which can be used by downstream indexing stages. Therefore this stage requires a Detect Sentences stage defined over these fields earlier in the pipeline.

This stage uses Apache OpenNLP project’s Part of Speech Tagger to mark tokens with their corresponding word type based on the token itself and the context of the token. The OpenNLP documentation states:

"A token might have multiple pos tags depending on the token and the context. The OpenNLP POS Tagger uses a probability model to predict the correct pos tag out of the tag set. To limit the possible tags for a token a tag dictionary can be used which increases the tagging and runtime performance of the tagger."

Fusion comes with a set of OpenNLP language models for english. These data files are found in the directory: fusion/3.1.x/data/nlp/models.

More models are available from the OpenNLP models SourceForge repository. Model files must be uploaded to Fusion using the Fusion Blob Store service via the REST API.

Part-of-speech Tagging in a NLP Pipeline

The following video shows how to use a Part-of-speech indexing stage as part of an NLP pipeline:

Stage Setup

Here is an example of how to upload a part-of-speech model file to the Fusion blob store:

INPUT

curl -u user:pass -X PUT --data-binary @en-pos-maxent.bin -H 'Content-type: text/plain' http://localhost:8764/api/apollo/blobs/en-pos-maxent.bin

OUTPUT

{
  "name" : "en-pos-maxent.bin",
  "contentType" : "text/plain",
  "size" : 5696197,
  "modifiedTime" : "2015-07-15T06:57:48.636Z",
  "version" : 0,
  "md5" : "db2cd70395b9e2e4c6b9957015a10607"
}

This is an example setup of this stage using the previously loaded .bin file:

INPUT

curl -u user:pass -X POST -H 'Content-type: application/json' -d '{"id":"TagPartofSpeech1", "type": "tag-part-of-speech","tokenizerModel":"en-pos-maxent.bin","posModel":"en-pos-perceptron.bin","source": ["sample","text","for","NLP"]}' http://localhost:8764/api/apollo/index-stages/instances

OUTPUT

{
  "type" : "tag-part-of-speech",
  "id" : "TagPartofSpeech1",
  "posModel" : "en-pos-perceptron.bin",
  "tokenizerModel" : "en-sent.bin",
  "source" : [ "sample", "text", "for", "NLP" ],
  "skip" : false,
  "label" : "tag-part-of-speech",
  "type" : "tag-part-of-speech"
}

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.