> ## Documentation Index
> Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Query-to-Query Session-Based Similarity Jobs

export const schema = {
  "type": "object",
  "title": "Query-to-Query Session Based Similarity",
  "description": "Use this job to to batch compute query-query similarities using a co-occurrence based approach",
  "required": ["id", "trainingCollection", "fieldToVectorize", "dataFormat", "docIdField", "type"],
  "properties": {
    "id": {
      "type": "string",
      "title": "Spark Job ID",
      "description": "The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.",
      "maxLength": 63,
      "pattern": "[a-zA-Z][_\\-a-zA-Z0-9]*[a-zA-Z0-9]?"
    },
    "sparkConfig": {
      "type": "array",
      "title": "Spark Settings",
      "description": "Spark configuration settings.",
      "hints": ["advanced"],
      "items": {
        "type": "object",
        "required": ["key"],
        "properties": {
          "key": {
            "type": "string",
            "title": "Parameter Name"
          },
          "value": {
            "type": "string",
            "title": "Parameter Value"
          }
        }
      }
    },
    "trainingCollection": {
      "type": "string",
      "title": "Input Collection",
      "description": "Collection containing queries, document id and event counts. Can be either signal aggregation collection or raw signals collection."
    },
    "fieldToVectorize": {
      "type": "string",
      "title": "Query Field Name",
      "description": "Field containing queries.",
      "default": "query_s",
      "minLength": 1
    },
    "dataFormat": {
      "type": "string",
      "title": "Data format",
      "description": "Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)",
      "default": "solr",
      "minLength": 1
    },
    "trainingDataFrameConfigOptions": {
      "type": "object",
      "title": "Dataframe Config Options",
      "description": "Additional spark dataframe loading configuration options",
      "properties": {},
      "additionalProperties": {
        "type": "string"
      },
      "hints": ["advanced"]
    },
    "trainingDataFilterQuery": {
      "type": "string",
      "title": "Data filter query",
      "description": "Solr query to additionally filter the input collection.",
      "default": "*:*",
      "hints": ["dummy"]
    },
    "sparkSQL": {
      "type": "string",
      "title": "Spark SQL filter query",
      "description": "Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input",
      "default": "SELECT * from spark_input",
      "hints": ["code/sql", "advanced"]
    },
    "trainingDataSamplingFraction": {
      "type": "number",
      "title": "Training data sampling fraction",
      "description": "Fraction of the training data to use",
      "default": 1,
      "hints": ["advanced"],
      "maximum": 1,
      "exclusiveMaximum": false
    },
    "randomSeed": {
      "type": "integer",
      "title": "Random seed",
      "description": "For any deterministic pseudorandom number generation",
      "default": 1234,
      "hints": ["advanced"]
    },
    "outputCollection": {
      "type": "string",
      "title": "Output collection",
      "description": "Collection to store synonym and similar query pairs.",
      "hints": ["dummy"]
    },
    "overwriteOutput": {
      "type": "boolean",
      "title": "Overwrite Output",
      "description": "Overwrite output collection",
      "default": true,
      "hints": ["hidden", "advanced"]
    },
    "dataOutputFormat": {
      "type": "string",
      "title": "Data output format",
      "description": "Spark-compatible output format (like 'solr', 'parquet', etc)",
      "default": "solr",
      "hints": ["advanced"],
      "minLength": 1
    },
    "sourceFields": {
      "type": "string",
      "title": "Fields to Load",
      "description": "Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.",
      "hints": ["dummy", "hidden"]
    },
    "partitionCols": {
      "type": "string",
      "title": "Partition fields",
      "description": "If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output ",
      "hints": ["advanced"]
    },
    "writeOptions": {
      "type": "array",
      "title": "Write Options",
      "description": "Options used when writing output to Solr or other sources",
      "hints": ["advanced"],
      "items": {
        "type": "object",
        "required": ["key"],
        "properties": {
          "key": {
            "type": "string",
            "title": "Parameter Name"
          },
          "value": {
            "type": "string",
            "title": "Parameter Value"
          }
        }
      }
    },
    "readOptions": {
      "type": "array",
      "title": "Read Options",
      "description": "Options used when reading input from Solr or other sources.",
      "hints": ["advanced"],
      "items": {
        "type": "object",
        "required": ["key"],
        "properties": {
          "key": {
            "type": "string",
            "title": "Parameter Name"
          },
          "value": {
            "type": "string",
            "title": "Parameter Value"
          }
        }
      }
    },
    "specialCharsFilterString": {
      "type": "string",
      "title": "Special characters to be filtered out",
      "description": "String of special characters to be filtered from queries.",
      "default": "~!@#$^%&*\\(\\)_+={}\\[\\]|;:\"'<,>.?`/\\\\-",
      "hints": ["advanced"]
    },
    "minQueryLength": {
      "type": "integer",
      "title": "Minimum query length",
      "description": "Queries below this length (in number of characters) will not be considered for generating recommendations.",
      "default": 3,
      "minimum": 1,
      "exclusiveMinimum": false
    },
    "maxQueryLength": {
      "type": "integer",
      "title": "Maximum query length",
      "description": "Queries above this length will not be considered for generating recommendations.",
      "default": 50,
      "minimum": 1,
      "exclusiveMinimum": false
    },
    "countField": {
      "type": "string",
      "title": "Event Count Field Name",
      "description": "Solr field containing number of events (e.g., number of clicks).",
      "default": "count_i"
    },
    "docIdField": {
      "type": "string",
      "title": "Document id Field Name",
      "description": "Solr field containing document id that user clicked.",
      "default": "doc_id_s"
    },
    "overlapThreshold": {
      "type": "number",
      "title": "Query Similarity Threshold",
      "description": "The threshold above which query pairs are consider similar. Decreasing the value can fetch more pairs at the expense of quality.",
      "default": 0.3,
      "hints": ["advanced"],
      "maximum": 1,
      "exclusiveMaximum": false
    },
    "minQueryCount": {
      "type": "integer",
      "title": "Query Clicks Threshold",
      "description": "The minimum number of clicked documents needed for comparing queries.",
      "default": 1,
      "hints": ["advanced"],
      "minimum": 1,
      "exclusiveMinimum": false
    },
    "overlapEnabled": {
      "type": "boolean",
      "title": "Boost on token overlap",
      "description": "Maximize score for query pairs with overlapping tokens by setting score to 1.",
      "default": true,
      "hints": ["advanced"]
    },
    "tokenOverlapValue": {
      "type": "number",
      "title": "Minimum match for token overlap",
      "description": "Minimum amount of overlap to consider for boosting. To specify overlap in terms of ratio, specify a value in (0, 1). To specify overlap in terms of exact count, specify a value >= 1. If value is 0, boost will be applied if one query is a substring of its pair.Stopwords are ignored while counting overlaps.",
      "default": 1,
      "hints": ["advanced"]
    },
    "sessionIdField": {
      "type": "string",
      "title": "Session/User ID field",
      "description": "If session id is not available, specify user id field instead. If this field is left blank, session based recommendations will be disabled.",
      "default": "session_id_s"
    },
    "minPairOccCount": {
      "type": "integer",
      "title": "Minimum query-recommendation pair occurrence count",
      "description": "Minimum number of times a query pair must be generated to be considered valid.",
      "default": 2,
      "hints": ["advanced"],
      "minimum": 1,
      "exclusiveMinimum": false
    },
    "stopwordsBlobName": {
      "type": "string",
      "title": "Stopwords Blob Store",
      "description": "Name of the stopwords blob resource. This is a .txt file with one stopword per line. By default the file is called stopwords/stopwords_nltk_en.txt however a custom file can also be used. Check documentation for more details on format and uploading to blob store.",
      "default": "stopwords/stopwords_en.txt",
      "reference": "blob",
      "blobType": "file:spark"
    },
    "type": {
      "type": "string",
      "title": "Spark Job Type",
      "enum": ["similar_queries"],
      "default": "similar_queries",
      "hints": ["readonly"]
    }
  },
  "additionalProperties": true,
  "category": "Other",
  "categoryPriority": 1,
  "propertyGroups": [{
    "label": "Input/Output Parameters",
    "properties": ["trainingCollection", "outputCollection", "dataFormat", "trainingDataFilterQuery", "readOptions", "writeOptions", "trainingDataFrameConfigOptions", "trainingDataSamplingFraction", "randomSeed"]
  }, {
    "label": "Field Parameters",
    "properties": ["fieldToVectorize", "sourceFields", "countField", "docIdField", "sessionIdField"]
  }, {
    "label": "Model Tuning Parameters",
    "properties": ["minQueryLength", "maxQueryLength", "specialCharsFilterString", "stopwordsBlobName", "overlapThreshold", "overlapEnabled", "tokenOverlapValue", "minQueryCount", "minPairOccCount"]
  }]
};

export const SchemaParamFields = ({schema}) => {
  const sanitize = str => {
    if (typeof str !== "string") return str;
    return str.replace(/^"(.*)"$/s, "$1").replace(/\\/g, "").replace(/"/g, "'");
  };
  const formatDescription = str => {
    const s = sanitize(str);
    return (/[.!?]\)*$/).test(s) ? s : `${s}.`;
  };
  const {description, properties = {}, required: requiredProps = []} = schema;
  const visibleProps = useMemo(() => Object.entries(properties).filter(([, prop]) => !prop.hints?.includes("hidden")), [properties]);
  return <div>
      {description && <p>{formatDescription(description)}</p>}

      {visibleProps.map(([name, prop]) => {
    const isRequired = requiredProps.includes(name);
    const hasDefault = prop.default !== undefined;
    const rawDefault = prop.default;
    const isComplexDefault = hasDefault && (typeof rawDefault === "object" || typeof rawDefault === "string" && (rawDefault.length > 20 || rawDefault.includes('"')));
    const fieldProps = {
      key: name,
      body: prop.title || name,
      type: prop.type,
      ...prop.title && ({
        post: [<><span className="text-stone-400 dark:text-stone-500">API property: </span>{name}</>]
      }),
      ...isRequired && ({
        required: true
      }),
      ...!isComplexDefault && hasDefault ? {
        default: sanitize(String(rawDefault))
      } : {}
    };
    const isObject = prop.type === "object" && prop.properties;
    const isArrayOfObjects = prop.type === "array" && prop.items?.type === "object" && prop.items.properties;
    return <ParamField {...fieldProps}>
            {prop.description && <p>{formatDescription(prop.description)}</p>}

            {isComplexDefault && <div className="flex">
                <p>
                  <strong>Default:</strong>
                </p>
                <pre className="!my-0">
                  <code>
                    {JSON.stringify(rawDefault, null, 2)}
                  </code>
                </pre>
              </div>}

            {isArrayOfObjects && <div className="flex">
              <p>
                <strong>Object attributes:</strong>
              </p>
              <pre className="!my-0">
                <code>
                  {'{\n'}
                  {Object.entries(prop.items.properties).map(([iname, iprop]) => <>
                      {`  ${iname}`}
                      {prop.items?.required?.includes(iname) && <span style={{
      color: 'red'
    }}> required</span>}
                      {`: {\n    display name: ${sanitize(iprop.title || '')}\n    type: ${iprop.type}\n  }\n`}
                    </>)}
                  {'}'}
                </code>
              </pre>
              </div>}

            {isObject && <Expandable title="properties">
                <SchemaParamFields schema={{
      properties: prop.properties,
      required: prop.required
    }} />
              </Expandable>}
          </ParamField>;
  })}
    </div>;
};

export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => {
  const [isLoaded, setIsLoaded] = useState(false);
  useEffect(() => {
    const timer = setTimeout(() => {
      setIsLoaded(true);
    }, 500);
    return () => clearTimeout(timer);
  }, []);
  return <div className="lw-template-container">
      <Card title={title} icon={icon}>
        {isLoaded && <span dangerouslySetInnerHTML={{
    __html: `<lw-template id="a029c1a9-28be-427e-b0e1-5d918920246a"></lw-template
            >`
  }} />}
        <Link href={linkHref} className="agent-studio-link text-left text-gray-600 gap-2 dark:text-gray-400 text-sm font-medium flex flex-row items-center hover:text-primary dark:hover:text-primary-light group-hover:text-primary group-hover:dark:text-primary-light">Powered by Lucidworks Agent Studio</Link>
      </Card>
    </div>;
};

[localhost link]: http://localhost:3000/docs/5/fusion/reference/config-ref/jobs/query-to-query-session-based-similarity

[mintlify link]: https://doc.lucidworks.com/docs/5/fusion/reference/config-ref/jobs/query-to-query-session-based-similarity

[old doc.lw link]: https://doc.lucidworks.com/fusion/5.9/8802

This recommender is based on co-occurrence of queries in the context of clicked documents and sessions. It is useful when your data shows that users tend to search for similar items in a single search session. This method of generating query-to-query recommendations is faster and more reliable than the Query-to-Query Similarity recommender job, and is session-based unlike the similar queries previously generated as part of the [Synonym Detection job](/docs/5/fusion/reference/config-ref/jobs/synonym-detection).

|                      |                                                                                                                                                                                           |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Default job name** | `COLLECTION_NAME_query_recs`                                                                                                                                                              |
| **Input**            | Raw signals (the `_<collection_signals` collection by default).                                                                                                                           |
| **Output**           | [Queries-for-query recommendations](/docs/5/fusion/getting-data-out/query-enhancement/recommendations/queries-for-query) (the `COLLECTION_NAME_queries_query_recs` collection by default) |

|                          | query | count\_i | type              | timestamp\_tdt    | user\_id          | doc\_id | session\_id | fusion\_query\_id |
| ------------------------ | ----- | -------- | ----------------- | ----------------- | ----------------- | ------- | ----------- | ----------------- |
| Required signals fields: | ✅     | ✅        | See note 1 below. | See note 2 below. | See note 2 below. | ✅       | ✅           |                   |

**Note 1:** Required if you want to weight types differently.

**Note 2:** Required if you want to use time decay.

**Note 3:** Required if no `session_id` field is available. Either `user_id` or `session_id` is needed on response signals if doing any click path analysis from signals.

The job generates two sets of recommendations based on the two approaches described below, then merges and de-duplicates them to present unique query-recommender pairs.

| Similar queries based on documents clicked                                                                                                                                                                                                                                                                                                              | Similar queries based on co-occurrence in sessions                                                                                                                                                                                                                                                                |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Queries are considered for recommendation if two queries have similar sets of document IDs clicked according to the signals data. This is directly implemented from the similar queries portion of the [Synonym Detection job](/docs/5/fusion/reference/config-ref/jobs/synonym-detection).  This approach can work on both raw and aggregated signals. | Queries are considered for recommendation if two queries have co-occurred in the same session based on the assumption that users search for similar items in a single search session (this may or may not hold true depending on the data).  This approach, based on session/user IDs, needs raw signals to work. |

<Frame caption="Query-to-Query Session-Based Similarity job dataflow">
  <img src="https://mintcdn.com/lucidworks/iN-DD0xMOO3PKUmX/assets/images/5.2/similar-queries-rec-dataflow-diagram.png?fit=max&auto=format&n=iN-DD0xMOO3PKUmX&q=85&s=d34f9feb81587b6d46097d4548d569b3" alt="Query-to-Query Session-Based Similarity job dataflow" width="841" height="89" data-path="assets/images/5.2/similar-queries-rec-dataflow-diagram.png" />
</Frame>

A default Query-to-Query Session-Based Similarity job (`COLLECTION_NAME_query_recs`) and a dedicated [collection](#the-similar-queries-collection) and [pipeline](#the-query-pipeline) are created when you [enable recommendations for a collection](/docs/5/fusion/getting-data-out/query-enhancement/recommendations/getting-started).

At a minimum, you must configure these:

* an ID for this job
* the input collection containing the signals data, usually `COLLECTION_NAME_signals`
* the data format, usually `solr`
* the query field name, usually `query_s`
* the document ID field name, usually `doc_id_s`
* optionally, the user session ID or user ID field name

  <Note>
    If this field is not specified, then the job generates click-based recommendations only, without session-based recommendations.
  </Note>

<LwTemplate />

## Data tips

* Running the job on other types of data than signals is not recommended and may yield unexpected results.
* To get about 90% query coverage with the query pipeline, we recommend a raw signals dataset of about \~170k unique queries. More signals will generally improve coverage.
* On a raw signal dataset of about 3 million records, the job finishes execution in about 7-8 minutes on two executor pods with one CPU and about 3G of memory each. Your performance may vary depending on your configuration.

## Boosting recommendations

Generally if a query and recommendation has some token overlap, then they’re closely related and we want to highlight these. Therefore, query-recommendation pair similarity scores can be boosted based on token overlap. This overlap is calculated in terms of the number or fractions of tokens that overlap.

For example, consider the pair (“a red polo shirt”, “red polo”). If the minimum match parameter is set to 1, then there should be 1 token in common. For this example there is 1 token in common and therefore it is boosted. If it is set to 0.5, then at least half of the tokens from the shorter string (in terms of space separated tokens) should match. Here, the shorter string is “red polo” which is 2 tokens long. Therefore, to satisfy the boosting requirement, at least 1 token should match.

## Tuning tips

These tuning tips pertain to the advanced Model Tuning Parameters:

* **Special characters to be filtered out.** Special characters can cause problems with matching queries and are therefore removed in the job.

  <Note>
    Only the characters are removed, not the queries, so a query like `ps3$` becomes `ps3`.
  </Note>

* **Query similarity threshold.** This is for use by the similar queries portion of the job and is the same as that used in the Synonym and Similar Queries Detection job.

* **Boost on token overlap.** This enables or disables boosting of query recommendation pairs where all or some tokens match. How much match is required to boost can be configured using the next parameter.\
  For example, if this is enabled, then a query-recommendation pair like `(playstation 3, playstation console)` is boosted with a similarity score of 1, provided the minimum match is set to 1 token or 0.5.

* **Minimum match for token overlap.** Similar to the `mm` param in Solr, this defines the number/fraction of tokens that should overlap if boosting is enabled. Queries and recommendations are split by “ “ (space) and each part is considered a token. If using a less-than sign (\<), it must be escaped using a backslash.\
  The value can be an integer, such as 1, in which case that many tokens should match. So in the previous example, pair is boosted because the term “playstation” is common to both and the `mm` value is set to 1.\
  The value can also be a fraction, in which case that fraction of the shorter member of the query and recommendation pair should match. For example, if the value is set to 0.5 and query is 4 tokens long and recommendation is 6 tokens long then there should be at least 2 common tokens between query and recommendation.\
  Here the stopwords specified in the [list of stopwords](/docs/5/fusion/getting-data-out/query-enhancement/stopwords-files) are ignored while calculating the overlap.

* **Query clicks threshold.** The minimum number of clicked documents needed for comparing queries.

* **Minimum query-recommendation pair occurrence count.** Minimum limit for the number of times a query-recommendation pair needs to be generated to make it to the final similar query recommendation list. Default is set to 2. Higher value will improve quality but reduce coverage.

## The similar queries collection

The following fields are stored in the `COLLECTION_NAME_queries_query_recs` collection:

* `query_t`
* `recommendation_t`
* `similarity_d`, the similarity score
* `source_s`, the approach that generated this pair, one of the following: `SessionBased` or `ClickedDocumentBased`
* `query_count_l`, the number of times the query occurred in signals
* `recommendation_count_l`, the number of times recommendations occurred in signals
* `pair_count_l`, the number of instances of the pair generated in the final recommendations using either of the approaches
* `type_s`, always set to `similar_queries`

## The query pipeline

When you enable recommendations, a default query pipeline, `COLLECTION_NAME_queries_query_recs`. is created.

## Configuration properties

<SchemaParamFields schema={schema} />
