Product Selector

Fusion 5.9
    Fusion 5.9

    Training dataLucidworks AI custom embedding model training

    Training data for custom models requires both the index file and query file. The dataset_config contains the index and query files, which must:

    • Be in Parquet format.

    • Contain specific column names or the model training will fail. You have the option to change the default value of the column name, but some of the column names must be the same in the index and query files. The requirements are detailed in the sections for those files.

    Index file

    The index file, also referred to as the catalog index file, contains documents that will be searched during training. The file is stored in Google Cloud Storage (GCS).

    The index file content format is different based on the model type to be trained. For example, model types are general or eCommerce.

    The general recurrent neural network (RNN) index file format contains:

    Field Default value Description

    dataset_config.index_title_col_name

    null

    A column specifying the title or headline for a document.

    dataset_config.pkid_col_name

    pkid

    The unique product key ID. Required field. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training.

    dataset_config.index_desc_col_name

    text

    A freeform text field for the associated pkid.

    dataset_config.index_body_col_name

    null

    An optional column that contains additional text data used for vocabulary creation in case of word-based model training.

    Training requires at least one entry for dataset_config.index_title_col_name or dataset_config.index_desc_col_name. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding.

    The eCommerce RNN index file format contains:

    Field Default value Description

    dataset_config.index_title_col_name

    The freeform text field that contains the product name is the default value for this model type.

    Training data may be higher quality using this optional field. To determine the highest quality results for your organization, perform training using dataset_config.pkid_col_name (pkid) with and without including this field.

    dataset_config.pkid_col_name

    pkid

    The unique product key ID. Required field. If this document is associated with a query, there is at least one matching entry in the query file. If this document is not associated with a query, it is still used in vocabulary creation and evaluation purposes in model training.

    dataset_config.index_desc_col_name

    null

    A freeform text field for the associated pkid.

    dataset_config.index_body_col_name

    null

    An optional column that contains additional text data used for vocabulary creation in case of word-based model training.

    Training requires at least one entry for dataset_config.index_title_col_name or dataset_config.index_desc_col_name. If both are provided, they are concatenated into one content column that is used during the model training. Because of this, Lucidworks recommends these two columns be concatenated at index time for encoding.

    Query file

    The query file, also referred to as the signals file, contains query data associated with the index file entries. The file is stored in Google Cloud Storage (GCS).

    For optimal results, the recommended practice is to include at least 500 rows of unique query column items in the query file. While it is possible to train a useful custom model with fewer rows, a pre-trained model may be the best option.

    The query file must have a pkid (product key ID) column which refers to the relevant document or product ID. The file may contain multiple duplicates of any pkid because each document could be associated with several relevant queries.

    For evaluation purposes, 10% of unique queries (50 minimum and 5000 maximum) are automatically sampled into a validation set from the training query file.

    The general RNN query file format contains:

    Field Default value Description

    dataset_config.pkid_col_name

    pkid

    The unique product key ID. Required field. This must match an entry in the index file.

    dataset_config.query_col_name

    query

    A freeform text field.

    dataset_config.weight_col_name

    null

    A column specifying the weight or importance of each query.

    The eCommerce RNN query file format contains:

    Field Default value Description

    dataset_config.pkid_col_name

    pkid

    The unique product key ID. Required field. This must match an entry in the index file.

    dataset_config.query_col_name

    query

    A freeform text field.

    dataset_config.weight_col_name

    aggr_count

    The aggr_count is number of documents that match the query criteria, which is the weight of the query in relation to the document. In most cases, this value is used as a weight and must be greater than zero (0). If you do not use weights or there is no value, set this value to 1. The weight is used for training pairs sampling and to compute normalized discounted cumulative gain (NDCG) metrics. If all values are 1.0, binary NDCG is computed.

    Examples

    General RNN

    This example configures a general RNN:

    {
      "dataset_config.pkid_col_name": "doc_id",
      "dataset_config.index_desc_col_name": "summary",
      "dataset_config.weight_col_name": "weight"
    }

    The values in this example for dataset_config.index_title_col_name, dataset_config.index_body_col_name, and dataset_config.query_col_name are the default values. When you want to use the default, you do not have to include that parameter in the config because it is applied even if it is not included. This example includes those parameters and values for clarity.

    {
      "dataset_config": "mlp_general_rnn",
      "trainer_config": "mlp_general_rnn",
      "dataset_config.pkid_col_name": "doc_id",
      "dataset_config.index_title_col_name": null,
      "dataset_config.index_desc_col_name": "summary",
      "dataset_config.index_body_col_name": null,
      "dataset_config.query_col_name": "query",
      "dataset_config.weight_col_name": "weight"
    }

    eCommerce RNN

    This example configures an eCommerce RNN:

    {
      "dataset_config.index_title_col_name": "product_name",
      "dataset_config.index_desc_col_name": "BRAND"
    }

    The values in this example for dataset_config.index_body_col_name, dataset_config.query_col_name, and dataset_config.weight_col_name are the default values. When you want to use the default, you do not have to include that parameter in the config because it is applied even if it is not included. This example includes those parameters and values for clarity.

    In addition, the dataset_config.index_title_col_name and dataset_config.index_desc_col_name are provided when using the trained model for embeddings. To enable the query side vector search to take full advantage of model training, concatenate the product_name and BRAND before encoding at indexing.

    {
      "dataset_config": "mlp_ecommerce",
      "trainer_config": "mlp_ecommerce",
      "dataset_config.pkid_col_name": "pkid",
      "dataset_config.index_title_col_name": "product_name",
      "dataset_config.index_desc_col_name": "BRAND",
      "dataset_config.index_body_col_name": null,
      "dataset_config.query_col_name": "query",
      "dataset_config.weight_col_name": "aggr_count"
    }
    For more information about custom configuration parameters, see Custom configuration.

    How to acquire training data

    To acquire training data, you can:

    • Extract from signals, which are user interactions on your site. Examples include:

      • Clicks in documents after the search query on knowledge management sites

      • Add-to-cart and purchase signals in a specific query on eCommerce sites

      • Interactions after abandoning search results or rephrasing the query

        Signals can be aggregated by query, pkid pair, and aggregation count.

        An effective method for eCommerce queries is to use validation based on a time series. Before aggregation, split signals based on a timestamp and then aggregate both splits. This yields training and test sets based on a time series. For example, N-1 months of signals can be used for training and Nth (last) month for testing and evaluation.
    • Generate from your site’s client forum and call center logs with real user questions and answers

    • Build from frequently asked questions posed in queries, with answers contained in the documents from the index

    • Use information from documents in the index, such as titles, descriptions, and body text

    • Use datasets, labels returned in queries, and manual labels on your website