Phrase Extraction Jobs

Use this job when you want to identify phrases in your content.

To use this job, you must first upload the OpenNLP Maxent model to the blob store.

Minimum configuration

For most use cases, the minimum configuration for this job consists of these fields:

  • id/Spark Job ID

    Give this job an arbitrary ID string.

  • trainingCollection/Training Collection

    Specify the input collection.

  • fieldToVectorize/Field to Vectorize

    Specify the field in the input collection where phrases can be found.

  • outputCollection/Output Collection

    Specify the collection in which the output documents should be indexed.

When running this job over a primary collection, be sure to set attachPhrases/Extract Key Phrases from Input Text to "true". The default is "false", which works well when running the job over a signals collection.

Output documents

This job outputs one document per phrase, where each phrase appears in multiple documents from the input collection/field.

In each document, these fields are most useful:

  • The phrase itself is in the phrases_s field, which can be used for faceting.

  • The likelihood field gives the likelihood that the phrase is legitimate, from 0 to infinity.

    Low-probability phrases are automatically trimmed from the results.

  • When a phrase’s likelihood value is ambiguous, the review field is set to "true" to indicate that the phrase should be reviewed.

  • A phrase_count field indicates the number of instances of the phrase in the input collection.

The complete list of output fields is shown below.

Output fields


The name of the Phrase Extraction job that generated this document.


This is always key_phrases for documents generated by a Phrase Extraction job.


A unique ID for this document.


The collection used for this job’s input.


The likelihood that this phrases_s is a phrase, from 0 to infinity.


The number of occurrences of this phrase in the input collection.


The phrase detected by the job.


"True" indicates that this may not be a valid phrase and should be reviewed.


This is always "1".


The date and time when the document was generated.


The number of words in this phrase.


An internal Solr field used for partial updates.