Product Selector

Fusion 5.12
    Fusion 5.12

    Synonym and Similar Queries Detection Jobs

    Use this job to generate pairs of synonyms and pairs of similar queries. Two words are considered potential synonyms when they are used in a similar context in similar queries.

    For best job speed and to avoid memory issues, use aggregated signals instead of raw signals as input for this job.

    You can review, edit, deploy, or delete output from this job using the Query Rewriting UI.

    Output from the Token and Phrase Spell Correction job and the Phrase Extraction job can be used as input for this job.

    Input

    This job takes one or more of the following as input:

    Signal data

    This input is required; additional input is optional. Signal data can be either raw or aggregated. The job runs faster using aggregated signals. When raw signals are used as input, this job performs the aggregation.

    Use the trainingCollection/Input Collection parameter to specify the collection that contains the signal data.

    Misspelling job results

    Token and Phrase Spell Correction job results can be used to avoid finding mainly misspellings, or mixing synonyms with misspellings.

    Use the misspellingCollection/Misspelling Job Result Collection parameter to specify the collection that contains these results.

    Phrase detection job results

    Phrase Extraction job results can be used to find synonyms with multiple tokens, such as "lithium ion" and "ion battery".

    Use the keyPhraseCollection/Phrase Extraction Job Result Collection parameter to specify the collection that contains these results.

    Keywords

    A keywords list in the blob store can serve as a blacklist to prevent common attributes from being identified as potential synonyms.

    The list can include common attributes such as color, brand, material, and so on. For example, by including color attributes you can prevent "red" and "blue" from being identified as synonyms due to their appearance in similar queries such as "red bike" and "blue bike".

    The keywords file is in CSV format with two fields: keyword and type. You can add your custom keywords list here with the type value "stopwords". An example file is shown below:

    keyword,type
    cu,stopword
    ft,stopword
    mil,stopword
    watt,stopword
    wat,stopword
    foot,stopword
    feet,stopword
    gal,stopword
    unit,stopword
    lb,stopword
    wt,stopword
    cc,stopword
    cm,stopword
    kg,stopword
    km,stopword
    oz,stopword
    nm,stopword
    qt,stopword
    sale,stopword
    on sale,stopword
    for sale,stopword
    clearance,stopword
    gb,stopword
    gig,stopword
    color,stopword
    blue,stopword
    white,stopword
    black,stopword
    ivory,stopword
    grey,stopword
    brown,stopword
    silver,stopword
    light blue,stopword
    light ivory,stopword
    light grey,stopword
    light brown,stopword
    light silver,stopword
    light green,stopword

    Use the keywordsBlobName/Keywords Blob Store parameter to specify the name of the blob that contains this list.

    Custom Synonyms

    For some deployments there might be a need to use existing synonym definitions. You can import existing synonyms into the Synonym and Similar Queries Detection job as a text file. Upload your synonyms text file to the blob store and reference that file when creating the job.

    Output

    The output collection contains two tables distinguished by the doc_type field.

    The similar queries table

    If query leads to clicks on documents 1, 2, 3, and 4, and similar_query leads to clicks on documents 2, 3, 4, and 5, then there is sufficient overlap between the two queries to consider them similar.

    A statistic is constructed to compute similarities based on overlap counts and query counts. The resulting table consists of documents whose doc_type value is "query_rewrite" and type value is "simq".

    The similar queries table contains similar query pairs with these fields:

    query

    The first half of the two-query pair.

    similar_query

    The second half of the two-query pair.

    similarity

    A score between 0 and 1 indicating how similar the two queries are.

    All similarity values are greater than or equal to the configured Query Similarity Threshold to ensure that only high-similarity queries are kept and used as input to find synonyms.

    query_count

    The number of clicks received by the query_count query.

    To save computation time, only queries with at least as many clicks as the configured Query Clicks Threshold parameter are kept and used as input to find synonyms.

    similar_query_count

    The number of clicks received by the similar_query_count query.

    The synonyms table

    The synonyms table consists of documents whose doc_type value is "query_rewrite" and type value is "synonym":

    surface_form

    The first half of the two-synonym pair.

    synonym

    The second half of the two-synonym pair.

    context

    If there are more than two words or phrases with the same meaning, such as "macbook, apple mac, mac", then this field shows the group to which this pair belongs.

    similarity

    A similarity score to measure confidence.

    count

    The number of different contexts in which this synonym pair appears.

    The bigger the number, the higher the quality of the pair.

    suggestion

    The algorithm automatically selects context, synonym words or phrases, or the synonym_group, and puts it in this field.

    Use this field as the field to review.

    category

    Whether the synonym is actually a misspelling.

    Use this job to generate synonym and similar query pairs.

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 128 characters

    Match pattern: ^[A-Za-z0-9_\-]+$

    trainingCollection - stringrequired

    Collection containing queries, document id and event counts. Can be either signal aggregation collection or raw signals collection.

    >= 1 characters

    fieldToVectorize - stringrequired

    Field containing queries. Change to query_s to use aggregation collection

    >= 1 characters

    Default: query

    dataFormat - string

    Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

    Default: solr

    Allowed values: solrhdfsfileparquet

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to additionally filter the input collection.

    >= 3 characters

    Default: *:*

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    outputCollection - string

    Collection to store synonym and similar query pairs.

    misspellingCollection - string

    Solr collection containing reviewed result of Token and phrase spell correction job. Defaults to the query_rewrite_staging collection for the app.

    misspellingsFilterQuery - string

    Solr query to additionally filter the misspelling results. Defaults to reading all approved spell corrections.

    Default: type:spell

    keyPhraseCollection - string

    Solr collection containing reviewed result of Phrase extraction job. Defaults to the query_rewrite_staging collection for the app.

    keyPhraseFilterQuery - string

    Solr query to additionally filter the phrase extraction results. Defaults to reading all approved phrases.

    Default: type:phrase

    countField - stringrequired

    Solr field containing number of events (e.g., number of clicks). Change to aggr_count_i to use aggregated signals

    Default: count_i

    docIdField - stringrequired

    Solr field containing document id that user clicked. Change to doc_id_s for aggregation collection

    Default: doc_id

    overlapThreshold - number

    The threshold above which query pairs are consider similar. We can get more synonym pairs if increase this value but quality may get reduced.

    Default: 0.5

    similarityThreshold - number

    The threshold above which synonym pairs are consider similar. We can get more synonym pairs if increase this value but quality may get reduced.

    Default: 0.01

    minQueryCount - integer

    The min number of clicked documents needed for comparing queries.

    Default: 5

    keywordsBlobName - string

    Name of the keywords blob resource. Typically, this should be a csv file uploaded to blob store in a specific format. Check documentation for more details on format and uploading to blob store.

    analyzerConfigQuery - string

    LuceneTextAnalyzer schema for tokenizing queries (JSON-encoded)

    >= 1 characters

    Default: { "analyzers": [ { "name": "LetterTokLowerStem","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "letter" },"filters": [{ "type": "lowercase" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "KStem" }] }],"fields": [{ "regex": ".+", "analyzer": "LetterTokLowerStem" } ]}

    enableAutoPublish - boolean

    If true, automatically publishes rewrites for rules. Default is false to allow for initial human-aided reviewing

    Default: false

    type - stringrequired

    Default: synonymDetection

    Allowed values: synonymDetection