Entity Extraction

Fusion includes extensive entity extraction capabilities. Entity extraction is configured as an index pipeline stage and there are several stage types to correspond to the different types of entity extraction you’d like to perform on your documents. The different types are described in more detail below.

Many of the entity extraction capabilities require models or lookup files, and we have provided a number of these by default. You can find the files in fusion/3.0.x/data/nlp/ but in order to use them in an index pipeline stage, you will need to load them to Solr using the Blob Store API.

Loading Models and Lookup Files to Solr

To load the files, you simply need to make a PUT request with the Blob Store API, as in this example:

curl -u admin:pass -X PUT --data-binary @data/nlp/models/en-sent.bin -H 'Content-type: application/octet-stream' http://localhost:8764/api/apollo/blobs/sentenceModel.bin

Note that the endpoint is the ID of the file. You will use the ID you’ve assigned with the endpoint in the index pipeline definition to indicate the model or lookup file to use. If the ID is omitted from the request (which is possible with a POST request), a random ID will be assigned and it will be difficult to tell one stored blob from another.

Entity Extraction Capabilities

Lookup Lists

The lookup lists are the most numerous of the available files. Many of these are simple lists that you will want to add values to. However, some may be robust enough for your needs.

The available lists are found in fusion/3.0.x/data/nlp/gazetteer.

To use a lookup list-based entity extraction, you would configure a Gazetteer Lookup Extraction index stage as part of your pipeline.

OpenNLP Models

We have also included entity extraction models from the OpenNLP project. These models are based on news articles and may not be suitable for all entity extraction needs. The Fusion-supplied models are located in fusion/3.0.x/data/nlp/models.

If you have created your own model for your data, you can load it to the blob store and use it in the nlp index stage as described above.

To use an OpenNLP-based entity extraction, you would configure a OpenNLP NER Extraction index stage as part of your pipeline.

Regular Expression Extraction

The regular expression extraction allows using a regular expression to find entities in documents that should be extracted. The extracted entities will then be copied to a field defined by you.

To use regular expression extraction, you would configure a Regex Field Extraction index stage as part of your pipeline.

Regex Field Filter

The Regex Field Filter stage allows you to remove a field based on a regular expression. This removes the entire field from the document, there is not yet an option for removing only specific entities found in the field or for excluding entire documents based on values found in a field.

To use Regex Field Filter, you would configure a Regex Field Filter index stage as part of your pipeline.

Exclusion Lists

An exclusion list is a list of known items that should be removed from a document. This removes the entire field from the document, there is not yet an option for removing only specific entities found in the field or for excluding entire documents based on values found in a field.

Note that the list must be loaded to Solr using the Blob Store API before it can be used with a Fusion index pipeline.

To use exclusion list filtering, you would configure a Exclusion Filter Index Stage as part of your pipeline.

Filter Short Fields

The Filter Short Fields stage removes entities that are equal to or smaller than a defined character limit.

To use the Filter Short Fields stage, you would configure a Filter Short Fields index stage as part of your pipeline.