> ## Documentation Index
> Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Use case training data

> Lucidworks AI custom embedding model training

export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => {
  const [isLoaded, setIsLoaded] = useState(false);
  useEffect(() => {
    const timer = setTimeout(() => {
      setIsLoaded(true);
    }, 500);
    return () => clearTimeout(timer);
  }, []);
  return <div className="lw-template-container">
      <Card title={title} icon={icon}>
        {isLoaded && <span dangerouslySetInnerHTML={{
    __html: `<lw-template id="a029c1a9-28be-427e-b0e1-5d918920246a"></lw-template
            >`
  }} />}
        <Link href={linkHref} className="agent-studio-link text-left text-gray-600 gap-2 dark:text-gray-400 text-sm font-medium flex flex-row items-center hover:text-primary dark:hover:text-primary-light group-hover:text-primary group-hover:dark:text-primary-light">Powered by Lucidworks Agent Studio</Link>
      </Card>
    </div>;
};

[localhost link]: http://localhost:3000/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-model-training-data

[mintlify link]: https://doc.lucidworks.com/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-model-training-data

[old doc.lw link]: https://doc.lucidworks.com/lw-platform/ai/g0ji3z

Training data for custom models requires both the index file and query file. The `dataset_config` contains the index and query files, which must:

* Be in Parquet format.
* Contain specific column names or the model training will fail. You have the option to change the default value of the column name, but some of the column names must be the same in the index and query files. The requirements are detailed in the sections for those files.

<LwTemplate />

## Index file

The index file, also referred to as the catalog index file, contains documents that will be searched during training. The file is stored in Google Cloud Storage (GCS).

The index file content format is different based on the model type to be trained. Each use case describes the format to use. For more information, see [Training use cases](#training-use-cases).

## Query file

The query file, also referred to as the signals file, contains query data associated with the index file entries. The file is stored in Google Cloud Storage (GCS).

For optimal results, the recommended practice is to include at least 500 rows of unique query column items in the query file. While it is possible to train a useful custom model with fewer rows, a [pre-trained model](/docs/lw-platform/lw-ai/lw-ai-pre-trained-embedding-models) may be the best option.

Each use case describes the format to use. For more information, see [Training use cases](#training-use-cases).

The query file must have a `pkid` column which refers to the relevant document or product ID. The file may contain multiple duplicates of any `pkid` because each document could be associated with several relevant queries.

<Note>
  For evaluation purposes, 10% of unique queries (50 minimum and 5000 maximum) are automatically sampled into a validation set from the training query file.
</Note>

## Training use cases

Lucidworks AI supports three types of training use cases that are targeted for different uses. The use case types are:

* [General training use case type](#general-training-use-case-type)
* [ecommerce training use case type](#ecommerce-training-use-case-type)
* [Classification training use case type](#classification-training-use-case-type)
* [Part Number Classification training use case type](#part-number-classification-training-use-case-type)

### General training use case type

Click your organization type for information about the benefits of using the general training use case type:

<Tabs>
  <Tab title="Business-to-Consumer" icon="cart-shopping" iconType="sharp-solid">
    This use case captures user intent and enhances relevance for natural language queries, especially in large content libraries such as catalogs and knowledge databases.
  </Tab>

  <Tab title="Business-to-Business" icon="briefcase" iconType="sharp-solid">
    This use case improves product and information discovery and retrieval by interpreting semantic relationships between existing document content components such as titles, descriptions, and body text.
  </Tab>

  <Tab title="Knowledge Management" icon="lightbulb" iconType="sharp-solid">
    This use case indexes and interprets large volumes of content such as articles, studies, manuals, and knowlege databases to ground results and return precise, highly relevant responses to queries.
  </Tab>
</Tabs>

#### General RNN index file format

The general recurrent neural network (RNN) index file format contains:

<ResponseField name="dataset_config.index_title_col_name" type="string">
  A column specifying the title or headline for a document.
</ResponseField>

<ResponseField name="dataset_config.pkid_col_name" type="string" default="pkid" required>
  The unique product key ID. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training.
</ResponseField>

<ResponseField name="dataset_config.index_desc_col_name" type="string" default="text">
  A freeform text field for the associated `pkid`.
</ResponseField>

<ResponseField name="dataset_config.index_body_col_name" type="string">
  An optional column that contains additional text data used for vocabulary creation in case of word-based model training.
</ResponseField>

Training requires at least one entry for `dataset_config.index_title_col_name` or `dataset_config.index_desc_col_name`. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding.

#### General RNN query file format

The general RNN query file format contains:

<ResponseField name="dataset_config.pkid_col_name" type="string" default="pkid" required>
  The unique product key ID. This must match an entry in the index file.
</ResponseField>

<ResponseField name="dataset_config.query_col_name" type="string" default="query">
  A freeform text field.
</ResponseField>

<ResponseField name="dataset_config.weight_col_name" type="string">
  A column specifying the weight or importance of each query.
</ResponseField>

#### Evaluation metric training use case

The custom model feature monitors the `mrr@3` metric for the general use case. To view the metrics, navigate to the [Model Details screen](https://doc.lucidworks.com/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-model-training-user-interface#metrics).

#### Recommended model type to use

The following model types are recommended. For more information, see:

* [Transformer RNN models](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-transformer-rnn-models)
* [RNN models](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-rnn-models)

### Ecommerce training use case type

Click your organization type for information about the benefits of using the ecommerce training use case type:

<Tabs>
  <Tab title="Business-to-Consumer" icon="cart-shopping" iconType="sharp-solid">
    This use case interprets signals (such as click and purchase) to weight relevance, which is used to assess large product catalogs and better match queries to products.
  </Tab>

  <Tab title="Business-to-Business" icon="briefcase" iconType="sharp-solid">
    This use case combines aggregated user behavior signals and product metadata to enhance product discoverability using ranking and relevant retrieval. Ultimately, this can improve relevance and conversions.
  </Tab>

  <Tab title="Knowledge Management" icon="lightbulb" iconType="sharp-solid">
    For knowledge management organizations where the content consists of more product-related information, this use case indexes and interprets that content and enhances weighted relevance in results.
  </Tab>
</Tabs>

#### Ecommerce RNN index file format

The ecommerce RNN index file format contains:

<ResponseField name="dataset_config.index_title_col_name" type="string" default="Freeform text field containing the product name">
  Training data may be higher quality using this optional field. To determine the highest quality results for your organization, perform training using `dataset_config.pkid_col_name` (`pkid`) with and without including this field.
</ResponseField>

<ResponseField name="dataset_config.pkid_col_name" type="string" default="pkid" required>
  The unique product key ID. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training.
</ResponseField>

<ResponseField name="dataset_config.index_desc_col_name" type="string">
  A freeform text field for the associated `pkid`.
</ResponseField>

<ResponseField name="dataset_config.index_body_col_name" type="string">
  An optional column that contains additional text data used for vocabulary creation in case of word-based model training.
</ResponseField>

Training requires at least one entry for `dataset_config.index_title_col_name` or `dataset_config.index_desc_col_name`. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding.

#### Ecommerce RNN query file format

The ecommerce RNN query file format contains:

<ResponseField name="dataset_config.pkid_col_name" type="string" default="pkid" required>
  The unique product key ID. This must match an entry in the index file.
</ResponseField>

<ResponseField name="dataset_config.query_col_name" type="string" default="query">
  A freeform text field.
</ResponseField>

<ResponseField name="dataset_config.weight_col_name" type="string" default="aggr_count">
  The `aggr_count` is the number of documents that match the query criteria, which is the weight of the query in relation to the document. In most cases, this value is used as a weight and must be greater than zero (0). If you do not use weights or there is no value, set this value to **1**. The weight is used for training pairs sampling and to compute normalized discounted cumulative gain (NDCG) metrics. If all values are **1.0**, binary NDCG is computed.
</ResponseField>

#### Evaluation metric training use case

The custom model feature monitors the `ndcg@5` metric for the ecommerce use case, which can use weight to provide signal aggregation information. To view the metrics, navigate to the [Model Details screen](https://doc.lucidworks.com/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-model-training-user-interface#metrics).

If your ecommerce data does not have a weight value set, or if the value is 1.0, binary NDCG is calculated.

#### Recommended model type to use

The ecommerce model type is recommended. For more information, see [RNN models](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-rnn-models).

### Classification training use case type

Click your organization type for information about the benefits of using the classification training use case type:

<Tabs>
  <Tab title="Business-to-Consumer" icon="cart-shopping" iconType="sharp-solid">
    This use case categorizes customer queries, reviews, and content to enhance the efficiency and accuracy of search results and improve the efficiency organizational operations.
  </Tab>

  <Tab title="Business-to-Business" icon="briefcase" iconType="sharp-solid">
    This use case enhances categorization of documents, including technical documentation, support tickets, and product information. When these structured categories and labels are used, functions such as tagging, routing, and downstream automation are improved.
  </Tab>

  <Tab title="Knowledge Management" icon="lightbulb" iconType="sharp-solid">
    This use case categorizes large volumes of content to enhance filtering and accuracy of search results.
  </Tab>
</Tabs>

#### Classification index file format

The classification index file format contains:

<ResponseField name="dataset_config.index_title_col_name" type="string" default="label">
  A column specifying the label for a document. This is optional if the `dataset_config.pkid_col_name` is the classification label.
</ResponseField>

<ResponseField name="dataset_config.pkid_col_name" type="string" default="pkid or label" required>
  The unique product key ID or unique classification label. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training.
</ResponseField>

<ResponseField name="dataset_config.index_desc_col_name" type="string">
  A freeform text field for the associated `pkid`.
</ResponseField>

<ResponseField name="dataset_config.index_body_col_name" type="string">
  An optional column that contains additional text data used for vocabulary creation in case of word-based model training.
</ResponseField>

Training requires at least one entry for `dataset_config.index_title_col_name` or `dataset_config.index_desc_col_name`. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding.

#### Classification RNN query file format

The classification query file format contains:

<ResponseField name="dataset_config.pkid_col_name" type="string" default="pkid or label" required>
  The unique product key ID or unique classification label. This must match an entry in the index file.
</ResponseField>

<ResponseField name="dataset_config.query_col_name" type="string" default="text">
  A freeform text field.
</ResponseField>

<ResponseField name="dataset_config.weight_col_name" type="string">
  A column specifying the weight or importance of each query.
</ResponseField>

#### Recommended model type to use

The following model types are recommended for general classification. For more information, see:

* [Transformer RNN models](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-transformer-rnn-models)
* [General RNN model](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-rnn-models)

The ecommerce model type is recommended for ecommerce classification. For more information, see [Ecommerce RNN model](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-rnn-models).

#### See also

For more information, see [Using the Lucidworks AI embeddings and side-car collection in the Classification use case](/docs/lw-platform/lw-ai/lw-ai-apis/lw-ai-prediction-api/classification-prediction#using-the-lucidworks-ai-embeddings-and-side-car-collection).

### Part Number Classification training use case type

Click your organization type for information about the benefits of using the part number classification training use case type, which is also referred to as the Part Number Detection use case of the Lucidworks AI Prediction API:

<Tabs>
  <Tab title="Business-to-Consumer" icon="cart-shopping" iconType="sharp-solid">
    This use case helps customers find the correct replacement parts and products
    by automatically identifying and categorizing part numbers from photos,
    descriptions, or partial information. This reduces order errors, improves
    search accuracy for products such as automotive parts, appliances, and electronics, and
    enhances the customer purchasing experience.
  </Tab>

  <Tab title="Business-to-Business" icon="briefcase" iconType="sharp-solid">
    This use case streamlines procurement and supply chain operations by
    automatically classifying and cross-referencing manufacturer part numbers
    across different suppliers and systems. This enables faster parts sourcing,
    standardizes part identification across catalogs, improves warranty claim
    processing, and enhances inventory management accuracy.
  </Tab>

  <Tab title="Knowledge Management" icon="lightbulb" iconType="sharp-solid">
    This use case organizes technical documentation, service manuals, and
    maintenance records by automatically extracting and categorizing part
    numbers. This improves discoverability of parts-related content, enables
    tracking of part availability and updates, and enhances search across
    large technical knowledge bases.
  </Tab>
</Tabs>

#### Part Number Classification index file format

The part number classification index file format contains:

<ResponseField name="dataset_config.pkid_col_name" type="string" default="pkid" required>
  The unique product key ID. This field contains the part number values that always need to be classified as part numbers. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training. A freeform text field for the associated `pkid`.
</ResponseField>

<ResponseField name="dataset_config.index_text" type="string" required>
  This field contains content that is searched for, but does not include part numbers. This can be any query field content that is most commonly used.
</ResponseField>

#### Part Number Classification query file format

The classification query file format is entirely optional, but when it exists, it contains:

<ResponseField name="dataset_config.pkid_col_name" type="string" default="pkid" required>
  The unique product key ID. This is a freeform text field that must match a part number entry in the index file.
</ResponseField>

<ResponseField name="dataset_config.query_col_name" type="string" default="text" required>
  The values that are classified as the most commonly-entered queries. A freeform text field that is a combination of part number and non-part number content.
</ResponseField>

<ResponseField name="dataset_config.aggr_count" type="string">
  A column specifying the weight or importance of each query. The field should only contain positive integer values.
</ResponseField>

## How to acquire training data

To acquire training data, you can complete the following actions:

* Extract from signals, which are user interactions on your site. Examples include:

  * Clicks in documents after the search query on knowledge management sites
  * Add-to-cart and purchase complete signals in a specific query on ecommerce sites
  * Interactions after abandoning search results or rephrasing the query

    Signals can be aggregated by query, `pkid` pair, and aggregation count.

    <Tip>
      An effective method for ecommerce queries is to use validation based on a time series. Before aggregation, split signals based on a timestamp and then aggregate both splits. This yields training and test sets based on a time series. For example, `N-1` months of signals can be used for training and `Nth` (last) month for testing and evaluation.
    </Tip>

* Generate from your site’s client forum and call center logs with real user questions and answers

* Build from frequently asked questions posed in queries, with answers contained in the documents from the index

* Use information from documents in the index, such as titles, descriptions, and body text

* Use datasets, labels returned in queries, and manual labels on your website
