Phrase Extraction Jobs

Minimum configuration
Output documents
Output fields
Configuration properties

Identify multi-word phrases in signals.

This job is deprecated in Fusion 5.9.15 and will be removed in a future release. Lucidworks recommends migrating to Neural Hybrid Search, which achieves superior relevance compared to legacy machine learning methods.

Resolving Underperforming Queries

The course for Resolving Underperforming Queries focuses on tips for tuning, running, and cleaning up Fusion’s query rewrite jobs.


Default job name	`COLLECTION_NAME_phrase_extraction`
Input	Raw signals (the `COLLECTION_NAME_signals` collection by default)
Output	Extracted phrases (the `COLLECTION_NAME_query_rewrite_staging` collection by default)

	query	count_i	type	timestamp_tdt	user_id	doc_id	session_id	fusion_query_id
Required signals fields:	✅	✅	✅

This job writes to the COLLECTION_NAME_query_rewrite_staging collection. It also uses reviewed documents from that collection to improve the accuracy of the job. You can review, edit, deploy, or delete output from this job using the Query Rewriting. Fusion ships with the OpenNLP Maxent model already loaded in the blob store. This job’s output, and output from the Token and Phrase Spell Correction job, can be used as input for the Synonym Detection job.

Minimum configuration

For most use cases, the minimum configuration for this job consists of these fields:

id/Spark Job ID Give this job an arbitrary ID string.
trainingCollection/Training Collection Specify the input collection.
fieldToVectorize/Field to Vectorize Specify the field in the input collection where phrases can be found.
outputCollection/Output Collection Specify the collection in which the output documents should be indexed.

When running this job over a content document collection, be sure to set attachPhrases/Extract Key Phrases from Input Text to “true”. The default is “false”, which works well when running the job over a signals collection.

Output documents

By default, the job only outputs the phrases found from the original document. In each row of the phrases output, these fields are most useful:

The phrase itself is in the phrases_s field, which can be used for faceting.
The likelihood_d field gives the likelihood that the phrase is legitimate, from 0 to infinity.
Low-probability phrases are automatically trimmed from the results.
When a phrase’s likelihood value is ambiguous, the review field is set to “true” to indicate that the phrase should be reviewed.
A phrase_count field indicates the number of instances of the phrase in the input collection.

The complete list of output fields is shown below.

Output fields


`aggr_id_s`	The name of the Phrase Extraction job that generated this document.
`doc_type_s`	This is always `key_phrases` for documents generated by a Phrase Extraction job.
`id`	A unique ID for this document.
`input_collection`	The collection used for this job’s input.
`likelihood_d`	The likelihood that this `phrases_s` is a phrase, from 0 to infinity.
`phrase_count`	The number of occurrences of this phrase in the input collection.
`phrases_s`	The phrase detected by the job.
`review`	”True” indicates that this may not be a valid phrase and should be reviewed.
`score`	This is always “1”.
`timestamp`	The date and time when the document was generated.
`word_num_i`	The number of words in this phrase.
`_version_`	An internal Solr field used for partial updates.

If the attachPhrases/Extract Key Phrases from Input Text parameter is set to “true”, then the job also outputs the original documents from the input collection with an appended field, phrases_extracted_tt, that lists the extracted phrases from this document. The way to distinguish the phrases output from the original document output is by the field doc_type_s, with one of these values:

key_phrases denotes phrases output.
original_doc_with_phrases denotes the original documents.

Configuration properties

Parallel Bulk Loader Query-to-Query Collaborative Similarity Job

Get Started

Introduction to Fusion

Getting Data In

Getting Data Out

Operations

Reference

Developer Docs

Neural Hybrid Search

Release Notes

Resolving Underperforming Queries

Minimum configuration

Output documents

Output fields

Configuration properties

Get Started

Introduction to Fusion

Getting Data In

Getting Data Out

Operations

Reference

Developer Docs

Neural Hybrid Search

Release Notes

Resolving Underperforming Queries

​Minimum configuration

​Output documents

​Output fields

​Configuration properties

Minimum configuration

Output documents

Output fields

Configuration properties