Product Selector

Fusion 5.9
    Fusion 5.9

    Custom configurationLucidworks AI custom embedding model training

    Custom configuration

    Custom configuration is used to train models with advanced parameters in the Lucidworks AI custom model training user interface.

    Custom configuration parameters

    This section contains the most commonly used and important configuration parameters. You can use these parameters in the Lucidworks AI custom model training user interface > Train a new model > Manual entry > Custom Config field.

    Model type

    To set the dataset and training defaults, enter the appropriate value in the dataset_config and trainer_config fields:

    • mlp_general_rnn. This is used for the general recurrent neural networks (RNN) model type.

    • mlp_ecommerce_rnn. This is used for an eCommerce RNN model type.

    For example, to set General RNN model type, use:

    {
      "dataset_config": "mlp_general_rnn",
      "trainer_config": "mlp_general_rnn"
    }

    To set the eCommerce RNN model type, use:

    {
      "dataset_config": "mlp_ecommerce_rnn",
      "trainer_config": "mlp_ecommerce_rnn"
    }

    Model parameters

    The parameters nested in trainer_config let you set training and model encoder parameters. The most important parameters are:

    • "trainer_config/text_processor_config": "word_en". Determines which type of tokenization and embedding is used as the base for the recurrent neural network (RNN) model. For example, word or byte-pair encoding (BPE). For information about values, see Text processor.

      The syntax for the text_processor_config parameter name must use a forward slash / and not a period . because it sets a group of parameters using a group name. Other nested parameters use a period ..
    • "trainer_config.encoder_config.rnn_names_list": ["gru"]. Determines which bi-directional recurrent neural network (RNN) layers are used. Options include gru and lstm.

    • "trainer_config.encoder_config.rnn_units_list": [128]. The number of units for each recurrent neural network (RNN) layer.

      Because this is a bi-directional RNN, the encoder’s vector size is two times larger than the number of units in the last layer. For example, if one layer is 128 units, the output vector size is 256.

    You must specify the same number of units for trainer_config.encoder_config.rnn_units_list and its similarly-named trainer_config.encoder_config.rnn_names_list RNN layer. For example, rnn_units_list needs to be the same size as rnn_names_list.

    Advanced custom configuration parameters

    This section describes the most common advanced custom configuration parameters you can alter. Modifying the values does not typically provide a significant boost in quality. However, setting values incorrectly may cause serious quality degradation.

    Advanced model parameters

    • "trainer_config.trn_batch_size": null. The batch size to be used for a single model training update. By default, an appropriate batch size is automatically determined based on the dataset size. If the field is set to null, the batch size is also automatically determined based on the dataset size.

    • "trainer_config.num_epochs": 64. The number of epochs the training data must complete. An epoch is a full cycle where training data passes through the designated algorithms. During one epoch, the model processes all the training data examples (queries and index documents) at least one time.

    • "trainer_config.monitor_patience": 8. The number of epochs the training passes before it stops if there is no validation metric improvement during the epochs. The best model state based on the monitor validation metric is used as the final model.

      • For the general RNN, the mrr@3 metric is monitored and the monitor_patience default value is 8.

      • For the eCommerce RNN, the ndcg@5 metric is monitored and the monitor_patience default value is 16.

    • "trainer_config.encoder_config.emb_spdp": 0.3. This field provides a regularization effect, which is the process to simplify result answers. The regularization is applied between the token embeddings layer and the first recurrent neural network (RNN) layer.

    • "trainer_config.encoder_config.emb_trainable". This field determines if fine-tuning of the token embeddings is enabled. Examples of token embedding are word or byte pair encoding (BPE) token vectors. If set, it can improve the quality of the model if the query contains less natural language that negatively impacts training. Because the embeddings layer is the largest layer in the network, the process to improve the model requires enough training data to prevent overfitting.

      The default values are:

      • "trainer_config.encoder_config.emb_trainable": false. For mlp_general_rnn models.

      • "trainer_config.encoder_config.emb_trainable": true. For mlp_ecommerce_rnn models.

    Custom configuration examples

    To create a custom configuration, set a dataset_config and trainer_config. To minimize dimished quality in the training model, only change field parameters that deviate from the default.

    For detailed information about dataset_config for the index and query files, see Training data.

    General configuration

    This configuration uses all of the defaults for general RNN training, since no values deviating from the defaults are specified.

    In most cases, this configuration is sufficient if all of these apply:

    • Index Parquet file contains the pkid and text columns

    • Query Parquet file contains the pkid and query columns

    {
      "dataset_config": "mlp_general_rnn",
      "trainer_config": "mlp_general_rnn"
    }

    Advanced configuration

    {
      "dataset_config": "mlp_general_rnn",
      "dataset_config.pkid_col_name": "pkid",
      "dataset_config.index_title_col_name": null,
      "dataset_config.index_desc_col_name": "text",
      "dataset_config.index_body_col_name": null,
      "dataset_config.query_col_name": "query",
      "dataset_config.weight_col_name": null,
      "dataset_config.metrics_config.monitor_metric": "mrr@3",
      "trainer_config": "mlp_general_rnn",
      "trainer_config/text_processor_config": "word_en",
      "trainer_config.encoder_config.emb_trainable": false,
      "trainer_config.encoder_config.emb_spdp": 0.3,
      "trainer_config.encoder_config.rnn_names_list": ["gru"],
      "trainer_config.encoder_config.rnn_units_list": [128],
      "trainer_config.num_epochs": 64,
      "trainer_config.monitor_patience": 8,
      "trainer_config.trn_batch_size": null
    }

    General configuration with multilingual BPE embeddings

    This configuration uses all of the defaults for general RNN training except "trainer_config/text_processor_config": "bpe_multi". No other values deviate from the defaults.

    This configuration is sufficient if all of these apply:

    • Index Parquet file contains the pkid and text columns

    • Query Parquet file contains the pkid and query columns

    • Index and query data are composed of multilingual text

    {
      "dataset_config": "mlp_general_rnn",
      "trainer_config": "mlp_general_rnn",
      "trainer_config/text_processor_config": "bpe_multi"
    }

    Advanced configuration

    {
      "dataset_config": "mlp_general_rnn",
      "dataset_config.pkid_col_name": "pkid",
      "dataset_config.index_title_col_name": null,
      "dataset_config.index_desc_col_name": "text",
      "dataset_config.index_body_col_name": null,
      "dataset_config.query_col_name": "query",
      "dataset_config.weight_col_name": null,
      "dataset_config.metrics_config.monitor_metric": "mrr@3",
      "trainer_config": "mlp_general_rnn",
      "trainer_config/text_processor_config": "bpe_multi",
      "trainer_config.encoder_config.emb_trainable": false,
      "trainer_config.encoder_config.emb_spdp": 0.3,
      "trainer_config.encoder_config.rnn_names_list": ["gru"],
      "trainer_config.encoder_config.rnn_units_list": [128],
      "trainer_config.num_epochs": 64,
      "trainer_config.monitor_patience": 8,
      "trainer_config.trn_batch_size": null
    }

    General configuration with token embeddings fine-tuning

    This configuration uses all of the defaults for general RNN training except "trainer_config.encoder_config.emb_trainable": true", which enables embedding training. No other values deviate from the defaults.

    This configuration is sufficient if all of these apply:

    • Index Parquet file contains the pkid and text columns

    • Query Parquet file contains the pkid and query columns

    • Your data contains a significant number of business-specific or misspelled words, such as in eCommerce use cases

    {
      "dataset_config": "mlp_general_rnn",
      "trainer_config": "mlp_general_rnn",
      "trainer_config.encoder_config.emb_trainable": true
    }

    Advanced configuration

    {
      "dataset_config": "mlp_general_rnn",
      "dataset_config.pkid_col_name": "pkid",
      "dataset_config.index_title_col_name": null,
      "dataset_config.index_desc_col_name": "text",
      "dataset_config.index_body_col_name": null,
      "dataset_config.query_col_name": "query",
      "dataset_config.weight_col_name": null,
      "dataset_config.metrics_config.monitor_metric": "mrr@3",
      "trainer_config": "mlp_general_rnn",
      "trainer_config/text_processor_config": "word_en",
      "trainer_config.encoder_config.emb_trainable": true,
      "trainer_config.encoder_config.emb_spdp": 0.3,
      "trainer_config.encoder_config.rnn_names_list": ["gru"],
      "trainer_config.encoder_config.rnn_units_list": [128],
      "trainer_config.num_epochs": 64,
      "trainer_config.monitor_patience": 8,
      "trainer_config.trn_batch_size": null
    }

    Classification configuration for custom embedding model

    The custom embedding models for classification uses the following parameters:

    • The Index file contains the:

      • dataset_config.pkid_col_name where label is the default value.

      • dataset_config.index_title_col_name where label is the default value.

      • dataset_config.index_desc_col_name where null is the default value.

      • dataset_config.index_body_col_name where null is the default value.

    • The Query file contains the:

      • dataset_config.pkid_col_name where pkid is the default value and is used for class values.

      • dataset_config.query_col_name where freeform text is the default value.

      • dataset_config.weight_col_name where null is the default value in this positive numeric field.

    The basic eCommerce example uses:

    {
      "dataset_config" : "mlp_classification_rnn"
      "trainer_config" : "mlp_ecommerce_rnn"
    }

    The basic general example uses:

    {
      "dataset_config" : "mlp_classification_rnn"
      "trainer_config" : "mlp_general_rnn"
    }

    The basic tranformer example uses:

    {
      "dataset_config" : "mlp_classification_rnn"
      "trainer_config" : "mlp_transformer"
    }

    For more information about the configuration parameters, see Configuration.

    eCommerce configuration

    This configuration uses all of the defaults for eCommerce RNN training, since no values deviating from the defaults are specified.

    In most cases, this configuration is sufficient if all of these apply:

    • Index Parquet file contains the pkid and name columns

    • Query Parquet file contains the pkid, query, and weight columns

    {
      "dataset_config": "mlp_ecommerce_rnn",
      "trainer_config": "mlp_ecommerce_rnn"
    }

    Advanced configuration

    {
      "dataset_config": "mlp_ecommerce",
      "dataset_config.pkid_col_name": "pkid",
      "dataset_config.index_title_col_name": "name",
      "dataset_config.index_desc_col_name": null,
      "dataset_config.index_body_col_name": null,
      "dataset_config.query_col_name": "query",
      "dataset_config.weight_col_name": "aggr_count",
      "dataset_config.metrics_config.monitor_metric": "ndcg@5",
      "trainer_config": "mlp_ecommerce",
      "trainer_config/text_processor_config": "word_en",
      "trainer_config.encoder_config.emb_trainable": true,
      "trainer_config.encoder_config.emb_spdp": 0.3,
      "trainer_config.encoder_config.rnn_names_list": ["gru"],
      "trainer_config.encoder_config.rnn_units_list": [128],
      "trainer_config.num_epochs": 64,
      "trainer_config.monitor_patience": 16,
      "trainer_config.trn_batch_size": null
    }

    eCommerce configuration with Japanese small BPE embeddings

    This configuration uses all of the defaults for eCommerce RNN training except "trainer_config/text_processor_config": "bpe_ja_small". No other values deviate from the defaults.

    This configuration is sufficient if all of these apply:

    • Index Parquet file contains the pkid and text columns

    • Query Parquet file contains the pkid and query columns

    • Index and query data are composed of Japanese text

    {
      "dataset_config": "mlp_ecommerce_rnn",
      "trainer_config": "mlp_ecommerce_rnn",
      "trainer_config/text_processor_config": "bpe_ja_small"
    }

    Advanced configuration

    {
      "dataset_config": "mlp_ecommerce",
      "dataset_config.pkid_col_name": "pkid",
      "dataset_config.index_title_col_name": "name",
      "dataset_config.index_desc_col_name": null,
      "dataset_config.index_body_col_name": null,
      "dataset_config.query_col_name": "query",
      "dataset_config.weight_col_name": "aggr_count",
      "dataset_config.metrics_config.monitor_metric": "ndcg@5",
      "trainer_config": "mlp_ecommerce",
      "trainer_config/text_processor_config": "bpe_ja_small",
      "trainer_config.encoder_config.emb_trainable": true,
      "trainer_config.encoder_config.emb_spdp": 0.3,
      "trainer_config.encoder_config.rnn_names_list": ["gru"],
      "trainer_config.encoder_config.rnn_units_list": [128],
      "trainer_config.num_epochs": 64,
      "trainer_config.monitor_patience": 16,
      "trainer_config.trn_batch_size": null
    }

    eCommerce configuration with 2 RNN layers and 128 output vector size

    This configuration uses all of the defaults for eCommerce RNN training except "trainer_config.encoder_config.rnn_names_list": ["gru", "gru"], which is an additional gru layer and "trainer_config.encoder_config.rnn_units_list": [128, 64] where the additional gru layer is 64 units. No other values deviate from the defaults.

    This configuration is sufficient if all of these apply:

    • Index Parquet file contains the pkid and text columns

    • Query Parquet file contains the pkid and query columns

    • The output of the model is 128 vector dimension

    {
      "dataset_config": "mlp_ecommerce_rnn",
      "trainer_config": "mlp_ecommerce_rnn",
      "trainer_config.encoder_config.rnn_names_list": ["gru", "gru"],
      "trainer_config.encoder_config.rnn_units_list": [128, 64]
    }

    Advanced configuration

    {
      "dataset_config": "mlp_ecommerce",
      "dataset_config.pkid_col_name": "pkid",
      "dataset_config.index_title_col_name": "name",
      "dataset_config.index_desc_col_name": null,
      "dataset_config.index_body_col_name": null,
      "dataset_config.query_col_name": "query",
      "dataset_config.weight_col_name": "aggr_count",
      "dataset_config.metrics_config.monitor_metric": "ndcg@5",
      "trainer_config": "mlp_ecommerce",
      "trainer_config/text_processor_config": "word_en",
      "trainer_config.encoder_config.emb_trainable": true,
      "trainer_config.encoder_config.emb_spdp": 0.3,
      "trainer_config.encoder_config.rnn_names_list": ["gru", "gru"],
      "trainer_config.encoder_config.rnn_units_list": [128, 64],
      "trainer_config.num_epochs": 64,
      "trainer_config.monitor_patience": 16,
      "trainer_config.trn_batch_size": null
    }

    The configuration parameters for training a Custom Embedding Model through Lucidworks AI.

    dataset_config - dataset_config

    This field is a parent config that sets defaults for what can be used for training and evaluation and dataset specific parameters: where it's located, fields that should be used, monitor metric, etc.

    dataset_config.pkid_col_name - string

    This field allows the pkid (primary key ID) column to be mapped to another column name if `pkid` is not present in the columns. The pkid is a unique value for each document. Entries with a duplicate pkid are filtered out. Since not every pkid entry is associated with a query, there may be entries in the catalog index file that are not associated with a query file entry. It is required if not the default

    Default: pkid

    Allowed values: any string

    dataset_config.index_title_col_name - string

    This field allows title to be mapped to another column name if `title` is not present in the columns. If title and desc (description) are both provided in your config, they will need to be concatenated into a single text field at indexing. This is because title+desc are concatenated into a single text during model training. If only one is provided, then it doesn’t matter which field is used.

    Default: eCommerce='name', general=null

    Allowed values: any stringnull

    dataset_config.index_desc_col_name - string

    This field allows desc (description) to be mapped to another column name if `desc` is not present in the columns. If title and desc (description) are both provided in your config, they will need to be concatenated into a single text field at indexing. This is because title+desc are concatenated into a single text during model training. If only one is provided, then it doesn’t matter which field is used.

    Default: eCommerce=null, general='text'

    Allowed values: any stringnull

    dataset_config.index_body_col_name - string

    This field allows body to be mapped to another column name if `body` is not present in the columns. The body field is used purely for vocabulary creation and custom token embeddings training. If there is a lengthy text field that doesn’t make sense to use for training, it still might be helpful to use it to improve vocabulary coverage and tokenization.

    Default: null

    Allowed values: any stringnull

    dataset_config.query_col_name - string

    This field allows query to be mapped to another column name if `query` is not present in the columns. It is required if not the default.

    Default: query

    Allowed values: any stringnull

    dataset_config.weight_col_name - string

    This field allows weight to be mapped to another column name if weight is not present in the columns. It is required if not the default.

    Default: eCommerce='aggr_count', general=null

    Allowed values: any stringnull

    dataset_config.metrics_config.monitor_metric - string

    This field determines the monitor metric. The main metric at k that should be monitored to decide when to stop training. Possible main metrics are: hit, map, mrr, ndcg, & recall. It’s mainly used in deciding when the early stopping should happen. Specifically, when there is no increase in the dataset_config.metrics_config.monitor_metric value for a particular number of epochs (controlled by trainer_config.monitor_patience parameter), the training stops.

    Match pattern: (?:hit|map|mrr|ndcg|recall|f1)@(?:1|3|5|10)

    Default: eCommerce='ndcg@5', general='mrr@3'

    Allowed values: hit@1hit@3hit@5hit@10map@1map@3map@5map@10mrr@1mrr@3mrr@5mrr@10ndcg@1ndcg@3ndcg@5ndcg@10recall@1recall@3recall@5recall@10f1@1f1@3f1@5f1@10

    trainer_config - trainer_config

    This field is a parent config that sets defaults for: what kind of text processing should be applied to the data, which encoder architecture to use, which loss function and its parameters to use, which optimizer and its parameters to use, which learning rate scheduler and its parameters to use, specifies metric names and range at which they should.

    trainer_config/text_processor_config - string

    This field determines which type of tokenization and embedding is used as the base for the recurrent neural network (RNN) model. This field only displays for custom models with a TRAINING_FAILED status. For more information, see Lucidworks AI Models API text processors. From that topic, select View API specification for detailed API information.

    Default: word_en

    Allowed values: word_enbpe_en_smallbpe_en_largebpe_multibpe_bg_smallbpe_bg_largebpe_de_smallbpe_de_largebpe_es_smallbpe_es_largebpe_fr_smallbpe_fr_largebpe_it_smallbpe_it_largebpe_ja_smallbpe_ja_largebpe_ko_smallbpe_ko_largebpe_nl_smallbpe_nl_largebpe_ro_smallbpe_ro_largebpe_zh_smallbpe_zh_largeword_custombpe_custom

    trainer_config.encoder_config.emb_trainable - boolean

    This field determines if fine-tuning of the token embeddings is enabled. Examples of token embedding are word or byte pair encoding (BPE) token vectors. If set, it can improve the quality of the model if the query contains less natural language that negatively impacts training. Because the embeddings layer is the largest layer in the network, the process to improve the model requires enough training data to prevent overfitting.

    Default: eCommerce=true, general=false

    trainer_config.encoder_config.emb_spdp - float

    This field provides a regularization effect, which is the process to simplify result answers. The regularization is applied between the token embeddings layer and the first recurrent neural network (RNN) layer. It is rare for this parameter field to a require change from the default.

    <= 1

    Default: 0.3

    Allowed values: 00.10.20.30.40.50.60.70.80.91

    trainer_config.encoder_config.rnn_names_list - List <string>

    This field determines which bi-directional recurrent neural network (RNN) layers are used. The length of this list must be matched to the list length on the trainer_config.encoder_config.rnn_names_list

    Default: [ 'gru' ]

    Allowed values: grulstm

    trainer_config.encoder_config.rnn_units_list - List <integer>

    The number of units for each recurrent neural network (RNN) layer. Because this is a bi-directional RNN, the encoder’s vector size is two times larger than the number of units in the last layer. For example, if one layer is 128 units, the output vector size is 256.

    Default: [ 128 ]

    Allowed values: 163264128256512

    trainer_config.num_epochs - integer

    The number of epochs the training data must complete. An epoch is a full cycle where training data passes through the designated algorithms. During one epoch, the model processes all the training data examples (queries and index documents) at least one time.

    >= 1

    Default: 64

    trainer_config.monitor_patience - integer

    The number of epochs the training passes before it stops if there is no validation metric improvement during the epochs. The best model state based on the monitor validation metric is used as the final model. Monitor patience and monitor metric are interdependent.

    >= 1

    Default: eCommerce=16, general=8

    trainer_config.trn_batch_size - integer

    The batch size to be used for a single model training update. By default, an appropriate batch size is automatically determined based on the dataset size. If the field is set to `null`, the batch size is also automatically determined based on the dataset size.

    >= 1

    Default: null