How To
    Learn More

      Natural Language Processing

      This topic describes Fusion’s Natural Language Processing (NLP) features, available in the legacy OpenNLP NER Extraction index pipeline stage and the newer NLP Annotator index and query pipeline stages.

      OpenNLP NER Extraction pipeline stage

      The OpenNLP NER Extraction index pipeline stage performs only Named Entity Recognition (NER). This stage is available in all versions of Fusion.

      For additional NLP functionality, use the NLP Annotator pipeline stages, available in Fusion versions 4.2 and up. See below for details.

      NLP Annotator pipeline stages

      The NLP Annotator is both an index pipeline stage and a query pipeline stage. The NLP Annotator performs a variety of fundamental NLP tasks:

      If configured in an index pipeline, the NLP annotator performs selected NLP tasks on raw document content during the indexing process (see more details here). If configured in a query pipeline, the NLP annotator performs selected NLP tasks on the query text content (see more details here).

      NLP features

      Fusion’s NLP Annotator pipeline stages include the NLP features described below.

      Sentence detection

      Sentence detection is the process of analyzing text to determine sentence boundaries. It is typically the first step taken when performing any kind of natural language processing on a document. Commonly, a sentence is indexed as a multi-value field that can be used for various purposes, as in these examples:

      • Relevancy: Boost documents whose first sentence matches the query terms.

      • Snippets: When presenting the search results, display the first few sentences of each document.

      Named Entity Recognition (NER)

      Named Entity Recognition is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under these predefined classes:

      • person

      • organization

      • location

      For example:


      Name entity recognition is widely leveraged by today’s text mining projects. When organizations store large volumes of business documents in Fusion, the natural next step is to turn the large volume of text-centric data into some kind of knowledge base.

      Take entity linking projects, for example: The client may want to link all relevant documents with an existing list of entities of interest. One way of doing this is to extract entities from all raw text documents, then perform fuzzy matching or another kind of text pattern matching to link relevant documents with a specific entity from the given list. This is more efficient than scanning the whole document and trying to search for the entity name. In this scenario, NER extraction is an ideal tool.

      Fusion has integrated NER capability into its indexing and query pipelines to enable customers to perform knowledge discovery easily.

      Part-of-Speech (POS) tagging

      One of the most important roles of POS tagging is "word sense disambiguation". For instance, when searching for the word "present", if the intent is to look for the concept of gift, then having the word "present" tagged as a "noun" will help filter out content with "present" as a verb, representing an action of bringing before the public.