Lucidworks AI custom embedding model training
dataset_config
and trainer_config
fields:
mlp_general
. This is used for the general recurrent neural networks (RNN) model type.mlp_ecommerce
. This is used for an ecommerce RNN model type.trainer_config
let you set training and model encoder parameters. The most important parameters are:
"trainer_config/text_processor_config": "word_en"
. Determines which type of tokenization and embedding is used as the base for the recurrent neural network (RNN) model. For example, word or byte-pair encoding (BPE). For information about values, see Text processor.
/
and not a period .
because it sets a group of parameters using a group name. Other nested parameters use a period .
."trainer_config.encoder_config.rnn_names_list": ["gru"]
. Determines which bi-directional recurrent neural network (RNN) layers are used. Options include gru
and lstm
.
"trainer_config.encoder_config.rnn_units_list": [128]
. The number of units for each recurrent neural network (RNN) layer.trainer_config.encoder_config.rnn_units_list
and its similarly-named trainer_config.encoder_config.rnn_names_list
RNN layer. For example, rnn_units_list
needs to be the same size as rnn_names_list
.
"trainer_config.trn_batch_size": null
. The batch size to be used for a single model training update. By default, an appropriate batch size is automatically determined based on the dataset size. If the field is set to null
, the batch size is also automatically determined based on the dataset size.
"trainer_config.num_epochs": 64
. The number of epochs the training data must complete. An epoch is a full cycle where training data passes through the designated algorithms. During one epoch, the model processes all the training data examples (queries and index documents) at least one time.
"trainer_config.monitor_patience": 8
. The number of epochs the training passes before it stops if there is no validation metric improvement during the epochs. The best model state based on the monitor validation metric is used as the final model.
mrr@3
metric is monitored and the monitor_patience
default value is 8.ndcg@5
metric is monitored and the monitor_patience
default value is 16."trainer_config.encoder_config.emb_spdp": 0.3
. This field provides a regularization effect, which is the process to simplify result answers. The regularization is applied between the token embeddings layer and the first recurrent neural network (RNN) layer.
"trainer_config.encoder_config.emb_trainable"
. This field determines if fine-tuning of the token embeddings is enabled. Examples of token embedding are word or byte pair encoding (BPE) token vectors. If set, it can improve the quality of the model if the query contains less natural language that negatively impacts training. Because the embeddings layer is the largest layer in the network, the process to improve the model requires enough training data to prevent overfitting."trainer_config.encoder_config.emb_trainable": false
. For mlp_general
models."trainer_config.encoder_config.emb_trainable": true
. For mlp_ecommerce
models.dataset_config
and trainer_config
. To minimize dimished quality in the training model, only change field parameters that deviate from the default.
dataset_config
for the index and query files, see use case training data.General configuration
pkid
and text
columnspkid
and query
columnsGeneral configuration with multilingual BPE embeddings
"trainer_config/text_processor_config": "bpe_multi"
. No other values deviate from the defaults.This configuration is sufficient if all of these apply:pkid
and text
columnspkid
and query
columnsGeneral configuration with token embeddings fine-tuning
"trainer_config.encoder_config.emb_trainable": true
, which enables embedding training. No other values deviate from the defaults.This configuration is sufficient if all of these apply:pkid
and text
columnspkid
and query
columnsClassification configuration for custom embedding model
dataset_config.pkid_col_name
where label
is the default valuedataset_config.index_title_col_name
where label
is the default valuedataset_config.index_desc_col_name
where null
is the default valuedataset_config.index_body_col_name
where null
is the default valuedataset_config.pkid_col_name
where pkid
is the default value and is used for class
valuesdataset_config.query_col_name
where freeform text is the default valuedataset_config.weight_col_name
where null
is the default value in this positive numeric fieldEcommerce configuration
pkid
and name
columnspkid
, query
, and weight
columnsEcommerce configuration with Japanese small BPE embeddings
"trainer_config/text_processor_config": "bpe_ja_small"
. No other values deviate from the defaults.This configuration is sufficient if all of these apply:pkid
and text
columnspkid
and query
columnsEcommerce configuration with 2 RNN layers and 128 output vector size
"trainer_config.encoder_config.rnn_names_list": ["gru", "gru"]
, which adds an additional GRU layer, and "trainer_config.encoder_config.rnn_units_list": [128, 64]"
, which specifies 64 units for the second GRU layer. No other values deviate from the defaults.This configuration is sufficient if all of these apply:pkid
and text
columnspkid
and query
columnsGeneral configuration with all_minilm_l6_rnn transformer
all_minilm_l6_rnn
transformer.Ecommerce configuration with all_minilm_l6_rnn transformer
all_minilm_l6_rnn
transformer.