Product Selector

Fusion 5.9
    Fusion 5.9

    Phrase ExtractionJob configuration specifications

    Identify multi-word phrases in signals.

    Lucidworks offers free training to help you get started with Fusion. Check out the Resolving Underperforming Queries course, which focuses on tips for tuning, running, and cleaning up Fusion’s query rewrite jobs:

    Resolving Underperforming Queries

    Visit the LucidAcademy to see the full training catalog.

    Default job name

    COLLECTION_NAME_phrase_extraction

    Input

    Raw signals (the COLLECTION_NAME_signals collection by default)

    Output

    Extracted phrases (the COLLECTION_NAME_query_rewrite_staging collection by default)

    This job writes to the COLLECTION_NAME_query_rewrite_staging collection. It also uses reviewed documents from that collection to improve the accuracy of the job. You can review, edit, deploy, or delete output from this job using the Query Rewriting.

    Managed Fusion ships with the OpenNLP Maxent model already loaded in the blob store.

    This job’s output, and output from the Token and Phrase Spell Correction job, can be used as input for the Synonym Detection job.

    Minimum configuration

    For most use cases, the minimum configuration for this job consists of these fields:

    • id/Spark Job ID

      Give this job an arbitrary ID string.

    • trainingCollection/Training Collection

      Specify the input collection.

    • fieldToVectorize/Field to Vectorize

      Specify the field in the input collection where phrases can be found.

    • outputCollection/Output Collection

      Specify the collection in which the output documents should be indexed.

    When running this job over a content document collection, be sure to set attachPhrases/Extract Key Phrases from Input Text to "true". The default is "false", which works well when running the job over a signals collection.

    Output documents

    By default, the job only outputs the phrases found from the original document. In each row of the phrases output, these fields are most useful:

    • The phrase itself is in the phrases_s field, which can be used for faceting.

    • The likelihood_d field gives the likelihood that the phrase is legitimate, from 0 to infinity.

      Low-probability phrases are automatically trimmed from the results.

    • When a phrase’s likelihood value is ambiguous, the review field is set to "true" to indicate that the phrase should be reviewed.

    • A phrase_count field indicates the number of instances of the phrase in the input collection.

    The complete list of output fields is shown below.

    Output fields

    aggr_id_s

    The name of the Phrase Extraction job that generated this document.

    doc_type_s

    This is always key_phrases for documents generated by a Phrase Extraction job.

    id

    A unique ID for this document.

    input_collection

    The collection used for this job’s input.

    likelihood_d

    The likelihood that this phrases_s is a phrase, from 0 to infinity.

    phrase_count

    The number of occurrences of this phrase in the input collection.

    phrases_s

    The phrase detected by the job.

    review

    "True" indicates that this may not be a valid phrase and should be reviewed.

    score

    This is always "1".

    timestamp

    The date and time when the document was generated.

    word_num_i

    The number of words in this phrase.

    _version_

    An internal Solr field used for partial updates.

    If the attachPhrases/Extract Key Phrases from Input Text parameter is set to "true", then the job also outputs the original documents from the input collection with an appended field, phrases_extracted_tt, that lists the extracted phrases from this document.

    The way to distinguish the phrases output from the original document output is by the field doc_type_s, with one of these values:

    • key_phrases denotes phrases output.

    • original_doc_with_phrases denotes the original documents.

    Use this job when you want to identify statistically significant phrases in your content.

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Spark configuration settings.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    trainingCollection - stringrequired

    Solr Collection containing labeled training data

    >= 1 characters

    fieldToVectorize - stringrequired

    Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.

    >= 1 characters

    dataFormat - stringrequired

    Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

    >= 1 characters

    Default: solr

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to use when loading training data if using Solr

    Default: *:*

    sparkSQL - string

    Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

    Default: SELECT * from spark_input

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 8180

    outputCollection - string

    Solr Collection to store extracted phrases; defaults to the query_rewrite_staging collection for the associated app.

    dataOutputFormat - string

    Spark-compatible output format (like 'solr', 'parquet', etc)

    >= 1 characters

    Default: solr

    sourceFields - string

    Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

    partitionCols - string

    If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    ngramSize - integer

    The number of words in the ngram you want to consider for the sips.

    >= 2

    <= 5

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 3

    minmatch - integer

    The number of times a phrase must exist to be considered. NOTE: if input is non signal data, please reduce the number to e.g. 5.

    >= 1

    exclusiveMinimum: false

    Default: 100

    analyzerConfig - stringrequired

    The style of text analyzer you would like to use.

    Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

    attachPhrases - boolean

    Checking this will cause the job to associate extracted phrases from each source doc. and write them back to the output collection. If input data is signals, it is suggested to turn this option off. Also, currently it is not allowed to check this option while attempting to write to a _query_rewrite_staging collection.

    Default: false

    minLikelihood - number

    Phrases below this threshold will not be written in the output of this job.

    enableAutoPublish - boolean

    If true, automatically publishes rewrites for rules. Default is false to allow for initial human-aided reviewing

    Default: false

    sparkPartitions - integer

    Spark will re-partition the input to have this number of partitions. Increase for greater parallelism

    Default: 200

    type - stringrequired

    Default: sip

    Allowed values: sip