Product Selector

Fusion 5.9
    Fusion 5.9

    Token and Phrase Spell Correction Job

    Detect misspellings in queries or documents using the numbers of occurrences of words and phrases.

    Lucidworks offers free training to help you get started with Fusion. Check out the Resolving Underperforming Queries course, which focuses on tips for tuning, running, and cleaning up Fusion’s query rewrite jobs:

    Resolving Underperforming Queries

    Visit the LucidAcademy to see the full training catalog.

    Default job name

    COLLECTION_NAME_spell_correction

    Input

    Raw signals (the COLLECTION_NAME_signals collection by default)

    Output

    Synonyms (the COLLECTION_NAME_query_rewrite_staging collection by default)

    query
    count_i
    type
    timstamp_tdt
    user_id
    doc_id
    session_id
    fusion_query_id

    Required signals fields:

    required

    required

    required

    This job extracts tail tokens (one word) and phrases (two words) and finds similarly-spelled head tokens and phrases. For example, if two queries are spelled similarly, but one leads to a lot of traffic (head) and the other leads to a little or zero traffic (tail), then it is likely that the tail query is misspelled and the head query is its correction.

    If several matching head tokens are found for each tail token, the job can pick the best correction using multiple configurable criteria.

    You can review, edit, deploy, or delete output from this job using the Query Rewriting UI.

    Misspelled terms are completely replaced by their corrected terms. If you want to expand the query to include all alternative terms, set the synonyms to bi-directional. See Synonym Detection for more information.

    This job’s output, and output from the Phrase Extraction job, can be used as input for the Synonym Detection job.

    Solr treats spelling corrections as synonyms. See the blog post Multi-Word Synonyms: Solr Adds Query-Time Support for more details.

    1. Create a job

    Create a Token and Phrase Spell Correction job in the Jobs Manager.

    How to create a new job
    1. In the Fusion workspace, navigate to Collections > Jobs.

    2. Click Add and select the job type Token and phrase spell correction.

      The New Job Configuration panel appears.

    2. Configure the job

    Use the information in this section to configure the Token and Phrase Spell Correction job.

    Required configuration

    The configuration must specify:

    • Spark Job ID. Used in the API to reference the job. Maximum 63 alphabetic characters, hyphen (-), and underscore (_).

    • INPUT COLLECTION. The trainingCollection parameter that can contain signal data or non-signal data. For signal data, select Input is Signal Data (signalDataIndicator). Signals can be raw (from the _signals collection) aggregated (from the _signals_aggr collection).

    • INPUT FIELD. The fieldToVectorize parameter.

    • COUNT FIELD

      For example, if signal data follows the default Fusion setup, then count_i is the field that records the count of raw signals and aggr_count_i is the field that records the count after aggregation.

    See Configuration properties for more information.

    Event types

    The spell correction job lets you analyze query performance based on two different events:

    • The main event (the Main Event Type/mainType parameter)

    • The filtering/secondary event (the Filtering Event Type/filterType parameter)

      If you only have one event type, leave this parameter empty.

    For example, if you specify the main event type to be click with a minimum count of 0 and the filtering event type to be query with a minimum count of 20, then the job:

    • Filters on the queries that get searched at least 20 times.

    • Checks among those popular queries to determine which ones didn’t get clicked at all, or were only clicked a few times.

    Spell check documents

    If you unselect the Input is Signal Data checkbox to indicate finding misspellings from content documents rather than signals, then you do not need to specify the following parameters:

    • Count Field

    • Main Event Field

    • Filtering Event Type

    • Field Name of Signal Type

    • Minimum Main Event Count

    • Minimum Filtering Event Count

    Use a custom dictionary

    You can upload a custom dictionary of terms that are specific to your data, and specify it using the Dictionary Collection (dictionaryCollection) and Dictionary Field (dictionaryField) parameters. For example, in an e-commerce use case, you can use the catalog terms as the custom dictionary by specifying the product catalog collection as the dictionary collection and the product description field as the dictionary field.

    Example configuration

    This is an example configuration:

    Spell correction job configuration

    When you have configured the job, click Save to save the configuration.

    3. Run the job

    If you are finding spelling corrections in aggregated data, you need to run an aggregation job before running the Token and Phrase Spelling Correction job. You do not need to run a Head/Tail Analysis job. The Token and Phrase Spell Correction job does the head/tail processing it requires.
    How to run the job
    1. In the Fusion workspace, navigate to Collections > Jobs.

    2. Select the job from the job list.

    3. Click Run.

    4. Click Start.

    4. Analyze job output

    After the job finishes, misspellings and corrections are output into the query_rewrite_staging collection by default; you can change this by setting the outputCollection.

    An example record is as follows:

    correction_s                  laptop battery
    mis_string_len_i              14
    misspelling_s                 laptop baytery
    aggr_job_id_s                 162fcf94b20T3704c333
    score                         1
    collation_check_s             token correction included
    corCount_misCount_ratio_d     2095
    sound_match_b                 true
    id                            bf79c43b-fc6d-43a7-931e-185fdac5b624
    aggr_type_s                   tokenPhraseSpellCorrection
    aggr_id_s                     ecom_spell_check
    correction_types_s            phrase => phrase
    cor_count_i                   68648960
    suggested_correction_s        baytery=>battery
    cor_string_len_i              14
    token_wise_correction_s       baytery=>battery
    cor_token_size_i              2
    edit_dist_i                   1
    timestamp_tdt                 2018-04-25T13:23:40.728Z
    mis_count_i                   32768
    lastChar_match_b              true
    mis_token_size_i              2
    token_corr_for_phrase_cnt_i   1

    For easy evaluation, you can export the result output to a CSV file.

    Spellcheck output

    5. Use spell correction results

    You can use the resulting corrections in various ways. For example:

    • Put misspellings into the synonym list to perform auto-correction.

    • Help evaluate and guide the Solr spellcheck configuration.

    • Put misspellings into typeahead or autosuggest lists.

    • Perform document cleansing (for example, clean a product catalog or medical records) by mapping misspellings to corrections.

    Useful output fields

    In the job output, you generally only need to analyze the suggested_corrections field, which provides suggestions about using token correction or whole-phrase correction. If the confidence of the correction is not high, then the job labels the pair as "review" in this field. Pay special attention to the output records with the "review" labels.

    With the output in a CSV file, you can sort by mis_string_len (descending) and edit_dist (ascending) to position more probable corrections at the top. You can also sort by the ratio of correction traffic over misspelling traffic (the corCount_misCount_ratio field) to only keep high-traffic boosting corrections.

    For phrase misspellings, the misspelled tokens are separated out and put in the token_wise_correction field. If the associated token correction is already included in the one-word correction list, then the collation_check field is labeled as "token correction include." You can choose to drop those phrase misspellings to reduce duplications.

    Fusion counts how many phrase corrections can be solved by the same token correction and puts the number into the token_corr_for_phrase_cnt field. For example, if both "outdoor surveillance" and "surveillance camera" can be solved by correcting "surveillance" to "surveillance", then this number is 2, which provides some confidence for dropping such phrase corrections and further confirms that correcting "surveillance" to "surveillance" is legitimate.

    You might also see cases where the token-wise correction is not included in the list. For example, "xbow" to "xbox" is not included in the list because it can be dangerous to allow an edit distance of 1 in a word of length 4. But if multiple phrase corrections can be made by changing this token, then you can add this token correction to the list.

    Phrase corrections with a value of 1 for token_corr_for_phrase_cnt and with collation_check labeled as "token correction not included" could be potentially-problematic corrections.

    Fusion labels misspellings due to misplaced whitespaces with "combine/break words" in the correction_types field. If there is a user-provided dictionary to check against, and both spellings are in the dictionary with and without whitespace in the middle, we can treat these pairs as bi-directional synonyms ("combine/break words (bi-direction)" in the correction_types field).

    The sound_match and lastChar_match fields also provide useful information.

    Job tuning

    The job’s default configuration is a conservative, designed for higher accuracy and lower output. To produce a higher volume of output, you can consider giving more permissive values to the parameters below. Likewise, give them more restrictive values if you are getting too many results with low accuracy.

    When tuning these values, always test the new configuration in a non-production environment before deploying it in production.

    trainingDataFilterQuery/Data filter query

    See Event types above, then adjust this value to reflect the secondary event for your search application. To query all data, set this to *:*.

    minCountFilter/Minimum Filtering Event Count

    Lower this value to include less-frequent misspellings based on the data filter query.

    maxDistance/Maximum Edit Distance

    Raise this value to increase the number of potentially-related tokens and phrases detected.

    minMispellingLen/Minimum Length of Misspelling

    Lower this value to include shorter misspellings (which are harder to correct accurately).

    Query rewrite jobs post-processing cleanup

    To perform more extensive cleanup of query rewrites, complete the procedures in Query Rewrite Jobs Post-processing Cleanup.

    Use this job to compute token and phrase level spell correction which you can use in your synonym list.

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Spark configuration settings.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    trainingCollection - stringrequired

    Collection containing search strings and event counts. Should ideally be the signals collection.If an aggregation collection is being used, update the filter query in the advanced options

    >= 1 characters

    fieldToVectorize - stringrequired

    Field containing search strings.

    >= 1 characters

    Default: query

    dataFormat - stringrequired

    Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

    >= 1 characters

    Default: solr

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to use when loading training data if using Solr (e.g. type:click OR type:response), Spark SQL expression for all other data sources

    Default: *:*

    sparkSQL - string

    Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

    Default: SELECT * from spark_input

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    outputCollection - string

    Collection to store misspelling and correction pairs. Defaults to the query_rewrite_staging collection for the application.

    dataOutputFormat - string

    Spark-compatible output format (like 'solr', 'parquet', etc)

    >= 1 characters

    Default: solr

    partitionCols - string

    If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    stopwordsBlobName - string

    Name of stopwords blob resource (.txt or .rtf file uploaded to the blob store). This field is marked for deprecation. Going forward, please specify the stopwords blob name as a luceneSchema property.

    >= 1 characters

    dictionaryCollection - string

    Solr Collection containing dictionary with correct spellings. E.g., product catalog.

    dictionaryField - string

    Solr field containing dictionary text. Multiple fields can be specified using the format: field1,field2 etc.

    countField - string

    Solr field containing query count

    Default: count_i

    mainType - string

    The main signal event type (e.g. click) that the job is based on if input is signal data. E.g., if main type is click, then head and tail tokens/phrases are defined by the number of clicks.

    Default: click

    filterType - string

    The secondary event type (e.g. response) that can be used for filtering out rare searches.Note: In order to use this `response` default value, please make sure you have type:response in the input collection.If there is no need to filter on number of searches, please leave this parameter blank.

    Default: response

    signalTypeField - string

    The field name of signal type in the input collection.

    Default: type

    minCountMain - integer

    Minimum number of main events (e.g. clicks after aggregation) necessary for the query to be considered. The job will only analyze queries with clicks greater or equal to this number.

    Default: 1

    minCountFilter - integer

    Minimum number of filtering events (e.g. searches after aggregation) necessary for the query to be considered. The job will only analyze queries that were issued greater or equal to this number of times.

    Default: 10

    dictionaryDataFilterQuery - string

    Solr query to use when loading dictionary data

    Default: *:*

    minPrefix - integer

    The minimum number of matches on starting characters. Note: Setting it to 0 may largely increase running time.

    exclusiveMinimum: false

    Default: 1

    minMispellingLen - integer

    The minimum length of misspelling to check. Smaller number may lead to problematic corrections. E.g., It is hard to find the right correction for a two or three character string.

    >= 1

    exclusiveMinimum: false

    Default: 5

    maxDistance - integer

    The maximum edit distance between related token/phrases you are interested in. Large number leads to longer correction list but may add lower quality corrections.

    >= 1

    exclusiveMinimum: false

    Default: 2

    lastCharMatchBoost - number

    When there are multiple possible corrections, we rank corrections based on: editDistBoost / editDist + correctionCountBoost * log(correctionCount) + lastCharMatchBoost * lastCharMatch + soundMatchBoost * soundexMatch. Big number puts more weight on last character match between misspelling and correction strings

    Default: 1

    soundMatchBoost - number

    When there are multiple possible corrections, we rank corrections based on: editDistBoost / editDist + correctionCountBoost * log(correctionCount) + lastCharMatchBoost * lastCharMatch + soundMatchBoost * soundexMatch. Big number puts more weight on soundex match between misspelling and correction strings

    Default: 3

    correctCntBoost - number

    When there are multiple possible corrections, we rank corrections based on: editDistBoost / editDist + correctionCountBoost * log(correctionCount) + lastCharMatchBoost * lastCharMatch + soundMatchBoost * soundexMatch. Big number puts more weight on count of correction string occurrences.

    Default: 2

    editDistBoost - number

    When there are multiple possible corrections, we rank corrections based on: editDistBoost / editDist + correctionCountBoost * log(correctionCount) + lastCharMatchBoost * lastCharMatch + soundMatchBoost * soundexMatch. Big number puts more weight on shorter edit distance.

    Default: 2

    signalDataIndicator - boolean

    The input dataset that the spell checker based on is signal data. If the input data is content document rather than signal, please uncheck.

    Default: true

    analyzerConfigQuery - string

    LuceneTextAnalyzer schema for tokenization (JSON-encoded)

    >= 1 characters

    Default: { "analyzers": [ { "name": "LetterTokLowerStem","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "letter" },"filters": [{ "type": "lowercase" },{ "type": "KStem" }] }],"fields": [{ "regex": ".+", "analyzer": "LetterTokLowerStem" } ]}

    analyzerConfigDictionary - string

    LuceneTextAnalyzer schema for tokenization (JSON-encoded)

    >= 1 characters

    Default: { "analyzers": [ { "name": "LetterTokLowerStem","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "letter" },"filters": [{ "type": "lowercase" },{ "type": "KStem" }] }],"fields": [{ "regex": ".+", "analyzer": "LetterTokLowerStem" } ]}

    correctionThreshold - number

    The count of occurrence ABOVE which the token/phrases are likely to be corrected spellings. Note that this number can be either fraction (<1.0) to denote a quantile based on count number distribution (shown in the log) or a number (>1.0) to denote the absolute count. A big number may cause performance issues.

    Default: 0.8

    misspellingThreshold - number

    The count of occurrence BELOW which the token/phrases are likely to be misspellings. Note that this number can be either fraction (<1.0) to denote a quantile based on count number distribution (shown in the log) or a number (>1.0) to denote the absolute count.

    Default: 0.8

    lenScale - integer

    A scaling factor used to normalize the length of query string to compare against edit distances. The filtering is based on if edit_dist <= string_length/length_scale. A large value for this factor leads to a shorter correction list. A small value leads to a longer correction list but may add lower quality corrections.

    Default: 5

    corMisRatio - number

    Ratio between correction occurrence count and misspelling occurrence count. Pairs with ratio less than or equal to this number will be filtered. Big number leads to shorter correction list and may have higher quality corrections.

    Default: 3

    enableAutoPublish - boolean

    If true, automatically publishes rewrites for rules. Default is false to allow for initial human-aided reviewing

    Default: false

    sparkPartitions - integer

    Spark will re-partition the input to have this number of partitions. Increase for greater parallelism

    Default: 200

    type - stringrequired

    Default: tokenPhraseSpellCorrection

    Allowed values: tokenPhraseSpellCorrection