Default job name | COLLECTION_NAME_synonym_detection |
Input | ● Aggregated signals (the COLLECTION_NAME_signals_aggr collection by default)● Spell Correction job output (the COLLECTION_NAME_query_rewrite_staging collection by default)● Phrase Extraction job output (the COLLECTION_NAME_query_rewrite_staging collection by default) |
Output | Synonyms (the COLLECTION_NAME_query_rewrite_staging collection by default) |
query | count_i | type | timstamp_tdt | user_id | doc_id | session_id | fusion_query_id | |
---|---|---|---|---|---|---|---|---|
Required signals fields: | ✅ | ✅ | ✅ | ✅ |
Use Synonym Detection
_query_rewrite_staging
collection to the _query_rewrite
collection when its status has changed to “Approved” and it has been published.
_query_rewrite
collection. When you have finished your review, you must click Publish to deploy your changes.trainingCollection
/Input Collection parameter to specify the collection that contains the signal data.
misspellingCollection
/Misspelling Job Result Collection parameter to specify the collection that contains these results.
keyPhraseCollection
/Phrase Extraction Job Result Collection parameter to specify the collection that contains these results.
keyword
and type
. You can add your custom keywords list here with the type
value “stopwords”. An example file is shown below:
keywordsBlobName
/Keywords Blob Store parameter to specify the name of the blob that contains this list.
doc_type
field.
query
leads to clicks on documents 1, 2, 3, and 4, and similar_query
leads to clicks on documents 2, 3, 4, and 5, then there is sufficient overlap between the two queries to consider them similar.
A statistic is constructed to compute similarities based on overlap counts and query counts. The resulting table consists of documents whose doc_type
value is “query_rewrite” and type
value is “simq”.
The similar queries table contains similar query pairs with these fields:
query | The first half of the two-query pair. |
similar_query | The second half of the two-query pair. |
similarity | A score between 0 and 1 indicating how similar the two queries are. All similarity values are greater than or equal to the configured Query Similarity Threshold to ensure that only high-similarity queries are kept and used as input to find synonyms. |
query_count | The number of clicks received by the query_count query. To save computation time, only queries with at least as many clicks as the configured Query Clicks Threshold parameter are kept and used as input to find synonyms. |
similar_query_count | The number of clicks received by the similar_query_count query. |
doc_type
value is “query_rewrite” and type
value is “synonym”:
surface_form | The first half of the two-synonym pair. |
synonym | The second half of the two-synonym pair. |
context | If there are more than two words or phrases with the same meaning, such as “macbook, apple mac, mac”, then this field shows the group to which this pair belongs. |
similarity | A similarity score to measure confidence. |
count | The number of different contexts in which this synonym pair appears. The bigger the number, the higher the quality of the pair. |
suggestion | The algorithm automatically selects context , synonym words or phrases, or the synonym_group , and puts it in this field. Use this field as the field to review. |
category | Whether the synonym is actually a misspelling. |