NLP Annotator Index Stage

The NLP Annotator index stage performs Natural Language Processing tasks.

You can choose from 3 different NLP implementations:

Set up and behavior differ depending on the implementation.

OpenNLP

The OpenNLP implementation is ready to use out-of-the-box. Simply specify "opennlp" as the "Model ID" property.

These annotation tasks are supported:

  • NER

  • Sentence detection

  • POS Tagging

  • Shallow Parsing (Chunking)

SpaCy

The SpaCy implementation is ready to use out-of-the-box. The default SpaCy implementation uses the en_core_web_sm model. Specify "spacy" as the "Model ID" property.

These annotation tasks are supported:

  • NER

  • Sentence detection

  • POS tagging

Schemes used for labels for each of the annotation tasks can be found at https://spacy.io/api/annotation.

Spark NLP

The Spark NLP implementation requires you first download a Spark NLP model and upload it to Fusion.

  1. Download a model from https://nlp.johnsnowlabs.com/docs/en/pipelines

  2. Upload the model to Fusion using the following curl command:

    curl -u [username]:[password] \
      -X POST \
      "https://[fusion host]/api/ai/ml-models?modelId=[desired model ID]&type=spark-nlp" \
      -F "file=@/path/to/model.zip"

    For example, if you want to use the "Explain Document ML" model:

  3. Download the latest version of the "Explain Document ML model" (explain_document_ml_en_2.1.0_2.4_1563203154682.zip at the time of this writing)

  4. Upload the model to Fusion: curl -u [username]:[password] -X POST "https://[fusion host]/api/ai/ml-models?modelId=explain_document_ml&type=spark-nlp" -F "file=@/path/to/explain_document_ml_en_2.1.0_2.4_1563203154682.zip"`

  5. When configuring this stage, specify "explain_document_ml" as the Model ID

For Spark NLP, the annotation tasks are supported depends on the model used.

Example of how to use NLP Annotator Index stage:
  1. Add NLP Annotator index stage.

    add nlp stage

  2. Supply the Model ID ("opennlp", "spacy", or the model ID given to the uploaded Spark NLP model).

  3. Configure the index pipeline stage.

  4. Specify the source, label pattern, and target (destination) fields:

    • source field: the raw text with name entities to be extracted.

    • label pattern: regex pattern that matches the NER/POS labels: for example, PER. will match extracted name entities with label PERSON, while NN. will match tagged nouns.

    • target field: the outcome extraction/tagging and so on.

    source_target

    result

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.