> ## Documentation Index > Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt > Use this file to discover all available pages before exploring further. # Use case training data > Lucidworks AI custom embedding model training export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => { const [isLoaded, setIsLoaded] = useState(false); useEffect(() => { const timer = setTimeout(() => { setIsLoaded(true); }, 500); return () => clearTimeout(timer); }, []); return

{isLoaded && ` }} />} Powered by Lucidworks Agent Studio

; }; [localhost link]: http://localhost:3000/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-model-training-data [mintlify link]: https://doc.lucidworks.com/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-model-training-data [old doc.lw link]: https://doc.lucidworks.com/lw-platform/ai/g0ji3z Training data for custom models requires both the index file and query file. The `dataset_config` contains the index and query files, which must: * Be in Parquet format. * Contain specific column names or the model training will fail. You have the option to change the default value of the column name, but some of the column names must be the same in the index and query files. The requirements are detailed in the sections for those files. ## Index file The index file, also referred to as the catalog index file, contains documents that will be searched during training. The file is stored in Google Cloud Storage (GCS). The index file content format is different based on the model type to be trained. Each use case describes the format to use. For more information, see [Training use cases](#training-use-cases). ## Query file The query file, also referred to as the signals file, contains query data associated with the index file entries. The file is stored in Google Cloud Storage (GCS). For optimal results, the recommended practice is to include at least 500 rows of unique query column items in the query file. While it is possible to train a useful custom model with fewer rows, a [pre-trained model](/docs/lw-platform/lw-ai/lw-ai-pre-trained-embedding-models) may be the best option. Each use case describes the format to use. For more information, see [Training use cases](#training-use-cases). The query file must have a `pkid` column which refers to the relevant document or product ID. The file may contain multiple duplicates of any `pkid` because each document could be associated with several relevant queries. For evaluation purposes, 10% of unique queries (50 minimum and 5000 maximum) are automatically sampled into a validation set from the training query file. ## Training use cases Lucidworks AI supports three types of training use cases that are targeted for different uses. The use case types are: * [General training use case type](#general-training-use-case-type) * [ecommerce training use case type](#ecommerce-training-use-case-type) * [Classification training use case type](#classification-training-use-case-type) * [Part Number Classification training use case type](#part-number-classification-training-use-case-type) ### General training use case type Click your organization type for information about the benefits of using the general training use case type: This use case captures user intent and enhances relevance for natural language queries, especially in large content libraries such as catalogs and knowledge databases. This use case improves product and information discovery and retrieval by interpreting semantic relationships between existing document content components such as titles, descriptions, and body text. This use case indexes and interprets large volumes of content such as articles, studies, manuals, and knowlege databases to ground results and return precise, highly relevant responses to queries. #### General RNN index file format The general recurrent neural network (RNN) index file format contains: A column specifying the title or headline for a document. The unique product key ID. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training. A freeform text field for the associated `pkid`. An optional column that contains additional text data used for vocabulary creation in case of word-based model training. Training requires at least one entry for `dataset_config.index_title_col_name` or `dataset_config.index_desc_col_name`. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding. #### General RNN query file format The general RNN query file format contains: The unique product key ID. This must match an entry in the index file. A freeform text field. A column specifying the weight or importance of each query. #### Evaluation metric training use case The custom model feature monitors the `mrr@3` metric for the general use case. To view the metrics, navigate to the [Model Details screen](https://doc.lucidworks.com/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-model-training-user-interface#metrics). #### Recommended model type to use The following model types are recommended. For more information, see: * [Transformer RNN models](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-transformer-rnn-models) * [RNN models](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-rnn-models) ### Ecommerce training use case type Click your organization type for information about the benefits of using the ecommerce training use case type: This use case interprets signals (such as click and purchase) to weight relevance, which is used to assess large product catalogs and better match queries to products. This use case combines aggregated user behavior signals and product metadata to enhance product discoverability using ranking and relevant retrieval. Ultimately, this can improve relevance and conversions. For knowledge management organizations where the content consists of more product-related information, this use case indexes and interprets that content and enhances weighted relevance in results. #### Ecommerce RNN index file format The ecommerce RNN index file format contains: Training data may be higher quality using this optional field. To determine the highest quality results for your organization, perform training using `dataset_config.pkid_col_name` (`pkid`) with and without including this field. The unique product key ID. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training. A freeform text field for the associated `pkid`. An optional column that contains additional text data used for vocabulary creation in case of word-based model training. Training requires at least one entry for `dataset_config.index_title_col_name` or `dataset_config.index_desc_col_name`. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding. #### Ecommerce RNN query file format The ecommerce RNN query file format contains: The unique product key ID. This must match an entry in the index file. A freeform text field. The `aggr_count` is the number of documents that match the query criteria, which is the weight of the query in relation to the document. In most cases, this value is used as a weight and must be greater than zero (0). If you do not use weights or there is no value, set this value to **1**. The weight is used for training pairs sampling and to compute normalized discounted cumulative gain (NDCG) metrics. If all values are **1.0**, binary NDCG is computed. #### Evaluation metric training use case The custom model feature monitors the `ndcg@5` metric for the ecommerce use case, which can use weight to provide signal aggregation information. To view the metrics, navigate to the [Model Details screen](https://doc.lucidworks.com/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-model-training-user-interface#metrics). If your ecommerce data does not have a weight value set, or if the value is 1.0, binary NDCG is calculated. #### Recommended model type to use The ecommerce model type is recommended. For more information, see [RNN models](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-rnn-models). ### Classification training use case type Click your organization type for information about the benefits of using the classification training use case type: This use case categorizes customer queries, reviews, and content to enhance the efficiency and accuracy of search results and improve the efficiency organizational operations. This use case enhances categorization of documents, including technical documentation, support tickets, and product information. When these structured categories and labels are used, functions such as tagging, routing, and downstream automation are improved. This use case categorizes large volumes of content to enhance filtering and accuracy of search results. #### Classification index file format The classification index file format contains: A column specifying the label for a document. This is optional if the `dataset_config.pkid_col_name` is the classification label. The unique product key ID or unique classification label. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training. A freeform text field for the associated `pkid`. An optional column that contains additional text data used for vocabulary creation in case of word-based model training. Training requires at least one entry for `dataset_config.index_title_col_name` or `dataset_config.index_desc_col_name`. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding. #### Classification RNN query file format The classification query file format contains: The unique product key ID or unique classification label. This must match an entry in the index file. A freeform text field. A column specifying the weight or importance of each query. #### Recommended model type to use The following model types are recommended for general classification. For more information, see: * [Transformer RNN models](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-transformer-rnn-models) * [General RNN model](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-rnn-models) The ecommerce model type is recommended for ecommerce classification. For more information, see [Ecommerce RNN model](/docs/lw-platform/lw-ai/lw-ai-custom-embedding-model-training/custom-embedding-rnn-models). #### See also For more information, see [Using the Lucidworks AI embeddings and side-car collection in the Classification use case](/docs/lw-platform/lw-ai/lw-ai-apis/lw-ai-prediction-api/classification-prediction#using-the-lucidworks-ai-embeddings-and-side-car-collection). ### Part Number Classification training use case type Click your organization type for information about the benefits of using the part number classification training use case type, which is also referred to as the Part Number Detection use case of the Lucidworks AI Prediction API: This use case helps customers find the correct replacement parts and products by automatically identifying and categorizing part numbers from photos, descriptions, or partial information. This reduces order errors, improves search accuracy for products such as automotive parts, appliances, and electronics, and enhances the customer purchasing experience. This use case streamlines procurement and supply chain operations by automatically classifying and cross-referencing manufacturer part numbers across different suppliers and systems. This enables faster parts sourcing, standardizes part identification across catalogs, improves warranty claim processing, and enhances inventory management accuracy. This use case organizes technical documentation, service manuals, and maintenance records by automatically extracting and categorizing part numbers. This improves discoverability of parts-related content, enables tracking of part availability and updates, and enhances search across large technical knowledge bases. #### Part Number Classification index file format The part number classification index file format contains: The unique product key ID. This field contains the part number values that always need to be classified as part numbers. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training. A freeform text field for the associated `pkid`. This field contains content that is searched for, but does not include part numbers. This can be any query field content that is most commonly used. #### Part Number Classification query file format The classification query file format is entirely optional, but when it exists, it contains: The unique product key ID. This is a freeform text field that must match a part number entry in the index file. The values that are classified as the most commonly-entered queries. A freeform text field that is a combination of part number and non-part number content. A column specifying the weight or importance of each query. The field should only contain positive integer values. ## How to acquire training data To acquire training data, you can complete the following actions: * Extract from signals, which are user interactions on your site. Examples include: * Clicks in documents after the search query on knowledge management sites * Add-to-cart and purchase complete signals in a specific query on ecommerce sites * Interactions after abandoning search results or rephrasing the query Signals can be aggregated by query, `pkid` pair, and aggregation count. An effective method for ecommerce queries is to use validation based on a time series. Before aggregation, split signals based on a timestamp and then aggregate both splits. This yields training and test sets based on a time series. For example, `N-1` months of signals can be used for training and `Nth` (last) month for testing and evaluation. * Generate from your site’s client forum and call center logs with real user questions and answers * Build from frequently asked questions posed in queries, with answers contained in the documents from the index * Use information from documents in the index, such as titles, descriptions, and body text * Use datasets, labels returned in queries, and manual labels on your website