Synonym and Similar Queries Detection Jobs

Use this job to generate pairs of synonyms and pairs of similar queries. Two words are considered potential synonyms when they are used in a similar context in similar queries.

You can review, edit, deploy, or delete output from this job using the Query Rewriting UI.

Output from the Token and Phrase Spell Correction job and the Phrase Extraction job can be used as input for this job.

Input

This job takes one or more of the following as input:

Signal data

This input is required; additional input is optional. Signal data can be either raw or aggregated. The job runs faster using aggregated signals. When raw signals are used as input, this job performs the aggregation.

Use the trainingCollection/Input Collection parameter to specify the collection that contains the signal data.

Misspelling job results

Token and Phrase Spell Correction job results can be used to avoid finding mainly misspellings, or mixing synonyms with misspellings.

Use the misspellingCollection/Misspelling Job Result Collection parameter to specify the collection that contains these results.

Phrase detection job results

Phrase Extraction job results can be used to find synonyms with multiple tokens, such as "lithium ion" and "ion battery".

Use the keyPhraseCollection/Phrase Extraction Job Result Collection parameter to specify the collection that contains these results.

Keywords

A keywords list in the blob store can serve as a blacklist to prevent common attributes from being identified as potential synonyms.

The list can include common attributes such as color, brand, material, and so on. For example, by including color attributes you can prevent "red" and "blue" from being identified as synonyms due to their appearance in similar queries such as "red bike" and "blue bike".

The keywords file is in CSV format with two fields: keyword and type. You can add your custom keywords list here with the type value "stopwords". An example file is shown below:

keyword,type
cu,stopword
ft,stopword
mil,stopword
watt,stopword
wat,stopword
foot,stopword
feet,stopword
gal,stopword
unit,stopword
lb,stopword
wt,stopword
cc,stopword
cm,stopword
kg,stopword
km,stopword
oz,stopword
nm,stopword
qt,stopword
sale,stopword
on sale,stopword
for sale,stopword
clearance,stopword
gb,stopword
gig,stopword
color,stopword
blue,stopword
white,stopword
black,stopword
ivory,stopword
grey,stopword
brown,stopword
silver,stopword
light blue,stopword
light ivory,stopword
light grey,stopword
light brown,stopword
light silver,stopword
light green,stopword

Use the keywordsBlobName/Keywords Blob Store parameter to specify the name of the blob that contains this list.

Output

The output collection contains two tables distinguished by the doc_type_s field.

The similar queries table

If query 1 leads to clicks on documents 1, 2, 3, and 4, and query 2 leads to clicks on documents 2, 3, 4, and 5, then there is sufficient overlap between the two queries to consider them similar.

A statistic is constructed to compute similarities based on overlap counts and query counts. The resulting table consists of documents whose doc_type_s value is "similar queries".

The similar queries table contains similar query pairs with these fields:

query1_s

The first half of the two-query pair.

query2_s

The second half of the two-query pair.

similarity_d

A score between 0 and 1 indicating how similar the two queries are.

All similarity_d values are greater than or equal to the configured Query Similarity Threshold to ensure that only high-similarity queries are kept and used as input to find synonyms.

query1_count_i

The number of clicks received by the query1_s query.

To save computation time, only queries with at least as many clicks as the configured Query Clicks Threshold parameter are kept and used as input to find synonyms.

query2_count_i

The number of clicks received by the query2_s query.

The synonyms table

The synonyms table consists of documents whose doc_type_s value is "synonyms", with these fields:

synonym1_s

The first half of the two-synonym pair.

synonym2_s

The second half of the two-synonym pair.

context_s

An example of the context in which the synonym is used.

synonym_group_s

If there are more than two words or phrases with the same meaning, such as "macbook, apple mac, mac", then this field shows the group to which this pair belongs.

context_num_i

The number of different contexts in which this synonym pair appears.

Tip
The bigger the number, the higher the quality of the pair.

similarity_d

A similarity score to measure confidence.

similar_spell_b

Whether the synonym is actually a misspelling.

suggested_synonym_s

The algorithm automatically selects context_s, synonym words or phrases, or the synonym_group, and puts it in this field.

Tip
Use this field as the field to review.