Natural Language Processing

Table of Contents

OpenNLP NER Extraction pipeline stage
NLP Annotator pipeline stages
NLP features

This topic describes Fusion’s Natural Language Processing (NLP) features, available in the legacy OpenNLP NER Extraction index pipeline stage and NLP Annotator index and query pipeline stages.

Lucidworks offers free training to help you get started.

The Course for Natural Language Processing focuses on how to implement NLP applications to optimize search relevancy:

Visit the LucidAcademy to see the full training catalog.

OpenNLP NER Extraction pipeline stage

The OpenNLP NER Extraction index pipeline stage performs only Named Entity Recognition (NER). This stage is available in all versions of Fusion.

For additional NLP functionality, use the NLP Annotator pipeline stages, available in Fusion versions 4.2 and later. See below for details.

NLP Annotator pipeline stages

The NLP Annotator is both an index pipeline stage and a query pipeline stage. The NLP Annotator performs a variety of fundamental NLP tasks:

Sentence detection
Named Entity Extraction (NER)
Part-of-Speech (POS) Tagging

If configured in an index pipeline, the NLP annotator performs selected NLP tasks on raw document content during the indexing process (see more details here). If configured in a query pipeline, the NLP annotator performs selected NLP tasks on the query text content (see more details here).

NLP features

Fusion’s NLP Annotator pipeline stages include the NLP features described below.

Sentence detection

Sentence detection is the process of analyzing text to determine sentence boundaries. It is typically the first step taken when performing any kind of natural language processing on a document. Commonly, a sentence is indexed as a multi-value field that can be used for various purposes, as in these examples:

Relevancy: Boost documents whose first sentence matches the query terms.
Snippets: When presenting the search results, display the first few sentences of each document.

Named Entity Recognition (NER)

Named Entity Recognition is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under these predefined classes:

person
organization
location

For example:

Person		Organization		Location
Jane	is the CEO of	Example Company	, based in	San Francisco.

Person

Organization

Location

Jane

is the CEO of

Example Company

, based in

San Francisco.

Name entity recognition is widely leveraged by today’s text mining projects. When organizations store large volumes of business documents in Fusion, the natural next step is to turn the large volume of text-centric data into some kind of knowledge base.

Take entity linking projects, for example: The client may want to link all relevant documents with an existing list of entities of interest. One way of doing this is to extract entities from all raw text documents, then perform fuzzy matching or another kind of text pattern matching to link relevant documents with a specific entity from the given list. This is more efficient than scanning the whole document and trying to search for the entity name. In this scenario, NER extraction is an ideal tool.

Fusion has integrated NER capability into its indexing and query pipelines to enable customers to perform knowledge discovery easily.

Part-of-Speech (POS) tagging

One of the most important roles of POS tagging is "word sense disambiguation". For instance, when searching for the word "present", if the intent is to look for the concept of gift, then having the word "present" tagged as a "noun" will help filter out content with "present" as a verb, representing an action of bringing before the public.