Training dataLucidworks AI custom embedding model training
This feature is currently only available to clients who have contracted with Lucidworks for features related to Neural Hybrid Search and Lucidworks AI. |
Training data for custom models requires both the index file and query file. The dataset_config
contains the index and query files, which must:
-
Be in Parquet format.
-
Contain specific column names or the model training will fail. You have the option to change the default value of the column name, but some of the column names must be the same in the index and query files. The requirements are detailed in the sections for those files.
Index file
The index file, also referred to as the catalog index file, contains documents that will be searched during training. The file is stored in Google Cloud Storage (GCS).
The index file content format is different based on the model type to be trained. For example, model types are general or eCommerce.
-
The general recurrent neural network (RNN) index file format contains:
-
dataset_config.index_title_col_name
wherenull
is the default value for this model type. -
dataset_config.pkid_col_name
wherepkid
is the default value. The unique product key ID. Required field. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training. -
dataset_config.index_desc_col_name
wheretext
is the default value. A freeform text field for the associatedpkid
. -
dataset_config.index_body_col_name
wherenull
is the default value for this model type. An optional column that contains additional text data used for vocabulary creation in case of word-based model training.
-
Training requires at least one entry for dataset_config.index_title_col_name
or dataset_config.index_desc_col_name
. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding.
-
The eCommerce RNN index file format contains:
-
dataset_config.index_title_col_name
where the freeform text field that contains the product name is the default value for this model type. Training data may be higher quality using this optional field. To determine the highest quality results for your organization, perform training usingdataset_config.pkid_col_name
(pkid
) with and without including this field. -
dataset_config.pkid_col_name
wherepkid
is the default value. The unique product key ID. Required field. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training. -
dataset_config.index_desc_col_name
wherenull
is the default value. A freeform text field for the associatedpkid
. -
dataset_config.index_body_col_name
wherenull
is the default value for this model type. An optional column that contains additional text data used for vocabulary creation in case of word-based model training.
-
Training requires at least one entry for dataset_config.index_title_col_name
or dataset_config.index_desc_col_name
. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding.
Query file
The query file, also referred to as the signals file, contains query data associated with the index file entries. The file is stored in Google Cloud Storage (GCS).
For optimal results, the recommended practice is to include at least 500 rows of unique query column items in the query file. While it is possible to train a useful custom model with fewer rows, a pre-trained model may be the best option.
The query file must have a pkid
(product key ID) column which refers to the relevant document or product ID. The file may contain multiple duplicates of any pkid
because each document could be associated with several relevant queries.
For evaluation purposes, 10% of unique queries (50 minimum and 5000 maximum) are automatically sampled into a validation set from the training query file. |
-
The general RNN query file format contains:
-
dataset_config.pkid_col_name
wherepkid
is the default value. The unique product key ID. Required field. This must match an entry in the index file. -
dataset_config.query_col_name
wherequery
is the default value. A freeform text field. -
dataset_config.weight_col_name
wherenull
is the default value.
-
-
The eCommerce RNN query file format contains:
-
dataset_config.pkid_col_name
wherepkid
is the default value. The unique product key ID. Required field. This must match an entry in the index file. -
dataset_config.query_col_name
wherequery
is the default value. A freeform text field. -
dataset_config.weight_col_name
whereaggr_count
is the default value. Theaggr_count
is number of documents that match the query criteria, which is the weight of the query in relation to the document. In most cases, this value is used as a weight and must be greater than zero (0). If you do not use weights or there is no value, set this value to 1. The weight is used for training pairs sampling and to compute normalized discounted cumulative gain (NDCG) metrics. If all values are 1.0, binary NDCG is computed.
-
Examples
General RNN
This example configures a general RNN "dataset_config.pkid_col_name": "doc_id", "dataset_config.index_desc_col_name": "summary", and "dataset_config.weight_col_name": "weight".
The values in this example for dataset_config.index_title_col_name
, dataset_config.index_body_col_name
, and dataset_config.query_col_name
are the default values. When you want to use the default, you do not have to include that parameter in the config because it is applied even if it is not included. This example includes those parameters and values for clarity.
{
"dataset_config": "mlp_general_rnn",
"trainer_config": "mlp_general_rnn",
"dataset_config.pkid_col_name": "doc_id",
"dataset_config.index_title_col_name": null,
"dataset_config.index_desc_col_name": "summary",
"dataset_config.index_body_col_name": null,
"dataset_config.query_col_name": "query",
"dataset_config.weight_col_name": "weight"
}
eCommerce RNN
This example configures an eCommerce RNN "dataset_config.index_title_col_name": "product_name" and "dataset_config.index_desc_col_name": "BRAND".
The values in this example for dataset_config.index_body_col_name
, dataset_config.query_col_name
, and dataset_config.weight_col_name
are the default values. When you want to use the default, you do not have to include that parameter in the config because it is applied even if it is not included. This example includes those parameters and values for clarity.
In addition, the dataset_config.index_title_col_name
and dataset_config.index_desc_col_name
are provided when using the trained model for embeddings. To enable the query side vector search to take full advantage of model training, concatenate the product_name
and BRAND
before encoding at indexing.
{
"dataset_config": "mlp_ecommerce",
"trainer_config": "mlp_ecommerce",
"dataset_config.pkid_col_name": "pkid",
"dataset_config.index_title_col_name": "product_name",
"dataset_config.index_desc_col_name": "BRAND",
"dataset_config.index_body_col_name": null,
"dataset_config.query_col_name": "query",
"dataset_config.weight_col_name": "aggr_count"
}
For more information about custom configuration parameters, see Custom configuration. |
How to acquire training data
To acquire training data, you can:
-
Extract from signals, which are user interactions on your site. Examples include:
-
Clicks in documents after the search query on knowledge management sites
-
Add-to-cart and purchase signals in a specific query on eCommerce sites
-
Interactions after abandoning search results or rephrasing the query
Signals can be aggregated by query,
pkid
pair, and aggregation count.An effective method for eCommerce queries is to use validation based on a time series. Before aggregation, split signals based on a timestamp and then aggregate both splits. This yields training and test sets based on a time series. For example, N-1
months of signals can be used for training andNth
(last) month for testing and evaluation.
-
-
Generate from your site’s client forum and call center logs with real user questions and answers
-
Build from frequently asked questions posed in queries, with answers contained in the documents from the index
-
Use information from documents in the index, such as titles, descriptions, and body text
-
Use datasets, labels returned in queries, and manual labels on your website