Custom configuration

Custom configuration is used to train models with advanced parameters in the Custom model training user interface.

Custom configuration scenarios

Custom configuration is typically used in the following scenarios:

Training using the Lucidworks AI Models API, which requires the minimal, required custom configuration JSON.
Training with advanced parameters using either the Custom model training user interface or the Lucidworks AI Models API.

Custom configuration parameters

This section contains the most commonly used and important configuration parameters. You can use these parameters in the:

Model type

To set the dataset and training defaults, enter the appropriate value in the dataset_config and trainer_config fields:

mlp_general_rnn. This is used for the general recurrent neural networks (RNN) model type.
mlp_ecommerce_rnn. This is used for an ecommerce RNN model type.

For example, to set General RNN model type, use:

General RNN model
Ecommerce RNN model

{
  "dataset_config": "mlp_general",
  "trainer_config": "mlp_general_rnn"
}

{
  "dataset_config": "mlp_ecommerce",
  "trainer_config": "mlp_ecommerce_rnn"
}

Model parameters

The parameters nested in trainer_config let you set training and model encoder parameters. The most important parameters are:

"trainer_config/text_processor_config": "word_en". Determines which type of tokenization and embedding is used as the base for the recurrent neural network (RNN) model. For example, word or byte-pair encoding (BPE). For information about values, see Text processor.
Important The syntax for the text_processor_config parameter name must use a forward slash / and not a period . because it sets a group of parameters using a group name. Other nested parameters use a period ..
"trainer_config.encoder_config.rnn_names_list": ["gru"]. Determines which bi-directional recurrent neural network (RNN) layers are used. Options include gru and lstm.
"trainer_config.encoder_config.rnn_units_list": [128]. The number of units for each recurrent neural network (RNN) layer.
Because this is a bi-directional RNN, the encoder’s vector size is two times larger than the number of units in the last layer. For example, if one layer is 128 units, the output vector size is 256.

You must specify the same number of units for trainer_config.encoder_config.rnn_units_list and its similarly-named trainer_config.encoder_config.rnn_names_list RNN layer. For example, rnn_units_list needs to be the same size as rnn_names_list.

Advanced custom configuration parameters

This section describes the most common advanced custom configuration parameters you can alter. Modifying the values does not typically provide a significant boost in quality. However, setting values incorrectly may cause serious quality degradation.

Advanced model parameters

"trainer_config.trn_batch_size": null. The batch size to be used for a single model training update. By default, an appropriate batch size is automatically determined based on the dataset size. If the field is set to null, the batch size is also automatically determined based on the dataset size.
"trainer_config.num_epochs": 64. The number of epochs the training data must complete. An epoch is a full cycle where training data passes through the designated algorithms. During one epoch, the model processes all the training data examples (queries and index documents) at least one time.
"trainer_config.monitor_patience": 8. The number of epochs the training passes before it stops if there is no validation metric improvement during the epochs. The best model state based on the monitor validation metric is used as the final model.
- For the general RNN, the mrr@3 metric is monitored and the monitor_patience default value is 8.
- For the ecommerce RNN, the ndcg@5 metric is monitored and the monitor_patience default value is 16.
"trainer_config.encoder_config.emb_spdp": 0.3. This field provides a regularization effect, which is the process to simplify result answers. The regularization is applied between the token embeddings layer and the first recurrent neural network (RNN) layer.
"trainer_config.encoder_config.emb_trainable". This field determines if fine-tuning of the token embeddings is enabled. Examples of token embedding are word or byte pair encoding (BPE) token vectors. If set, it can improve the quality of the model if the query contains less natural language that negatively impacts training. Because the embeddings layer is the largest layer in the network, the process to improve the model requires enough training data to prevent overfitting.
The default values are:
- "trainer_config.encoder_config.emb_trainable": false. For mlp_general models.
- "trainer_config.encoder_config.emb_trainable": true. For mlp_ecommerce models.

Custom configuration examples

To create a custom configuration, set a dataset_config and trainer_config. To minimize diminished quality in the training model, only change field parameters that deviate from the default.

For detailed information about dataset_config for the index and query files, see use case training data.

General configuration

This configuration uses all of the defaults for general RNN training, since no values deviating from the defaults are specified.In most cases, this configuration is sufficient if all of these apply:

Index Parquet file contains the pkid and text columns
Query Parquet file contains the pkid and query columns

Basic configuration
Advanced configuration

{
  "dataset_config": "mlp_general",
  "trainer_config": "mlp_general_rnn"
}

{
  "dataset_config": "mlp_general",
  "dataset_config.pkid_col_name": "pkid",
  "dataset_config.index_title_col_name": null,
  "dataset_config.index_desc_col_name": "text",
  "dataset_config.index_body_col_name": null,
  "dataset_config.query_col_name": "query",
  "dataset_config.weight_col_name": null,
  "dataset_config.metrics_config.monitor_metric": "mrr@3",
  "trainer_config": "mlp_general_rnn",
  "trainer_config/text_processor_config": "word_en",
  "trainer_config.encoder_config.emb_trainable": false,
  "trainer_config.encoder_config.emb_spdp": 0.3,
  "trainer_config.encoder_config.rnn_names_list": ["gru"],
  "trainer_config.encoder_config.rnn_units_list": [128],
  "trainer_config.num_epochs": 64,
  "trainer_config.monitor_patience": 8,
  "trainer_config.trn_batch_size": null
}

General configuration with multilingual BPE embeddings

This configuration uses all of the defaults for general RNN training except "trainer_config/text_processor_config": "bpe_multi". No other values deviate from the defaults.This configuration is sufficient if all of these apply:

Index Parquet file contains the pkid and text columns
Query Parquet file contains the pkid and query columns
Index and query data are composed of multilingual text

Basic configuration
Advanced configuration

{
  "dataset_config": "mlp_general",
  "trainer_config": "mlp_general_rnn",
  "trainer_config/text_processor_config": "bpe_multi"
}

{
  "dataset_config": "mlp_general",
  "dataset_config.pkid_col_name": "pkid",
  "dataset_config.index_title_col_name": null,
  "dataset_config.index_desc_col_name": "text",
  "dataset_config.index_body_col_name": null,
  "dataset_config.query_col_name": "query",
  "dataset_config.weight_col_name": null,
  "dataset_config.metrics_config.monitor_metric": "mrr@3",
  "trainer_config": "mlp_general_rnn",
  "trainer_config/text_processor_config": "bpe_multi",
  "trainer_config.encoder_config.emb_trainable": false,
  "trainer_config.encoder_config.emb_spdp": 0.3,
  "trainer_config.encoder_config.rnn_names_list": ["gru"],
  "trainer_config.encoder_config.rnn_units_list": [128],
  "trainer_config.num_epochs": 64,
  "trainer_config.monitor_patience": 8,
  "trainer_config.trn_batch_size": null
}

General configuration with token embeddings fine-tuning

This configuration uses all of the defaults for general RNN training except "trainer_config.encoder_config.emb_trainable": true, which enables embedding training. No other values deviate from the defaults.This configuration is sufficient if all of these apply:

Index Parquet file contains the pkid and text columns
Query Parquet file contains the pkid and query columns
Your data contains a significant number of business-specific or misspelled words, such as in ecommerce use cases

Basic configuration
Advanced configuration

{
  "dataset_config": "mlp_general",
  "trainer_config": "mlp_general",
  "trainer_config.encoder_config.emb_trainable": true
}

{
  "dataset_config": "mlp_general",
  "dataset_config.pkid_col_name": "pkid",
  "dataset_config.index_title_col_name": null,
  "dataset_config.index_desc_col_name": "text",
  "dataset_config.index_body_col_name": null,
  "dataset_config.query_col_name": "query",
  "dataset_config.weight_col_name": null,
  "dataset_config.metrics_config.monitor_metric": "mrr@3",
  "trainer_config": "mlp_general_rnn",
  "trainer_config/text_processor_config": "word_en",
  "trainer_config.encoder_config.emb_trainable": true,
  "trainer_config.encoder_config.emb_spdp": 0.3,
  "trainer_config.encoder_config.rnn_names_list": ["gru"],
  "trainer_config.encoder_config.rnn_units_list": [128],
  "trainer_config.num_epochs": 64,
  "trainer_config.monitor_patience": 8,
  "trainer_config.trn_batch_size": null
}

Classification configuration for custom embedding model

The custom embedding models for classification use the following parameters:

The Index file contains:
- dataset_config.pkid_col_name where label is the default value
- dataset_config.index_title_col_name where label is the default value
- dataset_config.index_desc_col_name where null is the default value
- dataset_config.index_body_col_name where null is the default value
The Query file contains:
- dataset_config.pkid_col_name where pkid is the default value and is used for class values
- dataset_config.query_col_name where freeform text is the default value
- dataset_config.weight_col_name where null is the default value in this positive numeric field

Ecommerce example
General example
Transformer example

{
  "dataset_config": "mlp_classification",
  "trainer_config": "mlp_ecommerce_rnn"
}

{
  "dataset_config": "mlp_classification",
  "trainer_config": "mlp_general_rnn"
}

The basic transformer example uses the "trainer_config": "mlp_transformer" value.
For example, if the transformer is gte_large_rnn, use:

{
  "dataset_config": "mlp_classification",
  "trainer_config": "gte_large_rnn"
}

For more information about the configuration parameters, see Configuration.

Ecommerce configuration

This configuration uses all of the defaults for ecommerce RNN training, since no values deviate from the defaults.In most cases, this configuration is sufficient if all of these apply:

Index Parquet file contains the pkid and name columns
Query Parquet file contains the pkid, query, and weight columns

Basic configuration
Advanced configuration

{
  "dataset_config": "mlp_ecommerce",
  "trainer_config": "mlp_ecommerce_rnn"
}

{
  "dataset_config": "mlp_ecommerce",
  "dataset_config.pkid_col_name": "pkid",
  "dataset_config.index_title_col_name": "name",
  "dataset_config.index_desc_col_name": null,
  "dataset_config.index_body_col_name": null,
  "dataset_config.query_col_name": "query",
  "dataset_config.weight_col_name": "aggr_count",
  "dataset_config.metrics_config.monitor_metric": "ndcg@5",
  "trainer_config": "mlp_ecommerce_rnn",
  "trainer_config/text_processor_config": "word_en",
  "trainer_config.encoder_config.emb_trainable": true,
  "trainer_config.encoder_config.emb_spdp": 0.3,
  "trainer_config.encoder_config.rnn_names_list": ["gru"],
  "trainer_config.encoder_config.rnn_units_list": [128],
  "trainer_config.num_epochs": 64,
  "trainer_config.monitor_patience": 16,
  "trainer_config.trn_batch_size": null
}

Ecommerce configuration with Japanese small BPE embeddings

This configuration uses all of the defaults for ecommerce RNN training except "trainer_config/text_processor_config": "bpe_ja_small". No other values deviate from the defaults.This configuration is sufficient if all of these apply:

Index Parquet file contains the pkid and text columns
Query Parquet file contains the pkid and query columns
Index and query data are composed of Japanese text

Basic configuration
Advanced configuration

{
  "dataset_config": "mlp_ecommerce",
  "trainer_config": "mlp_ecommerce_rnn",
  "trainer_config/text_processor_config": "bpe_ja_small"
}

{
  "dataset_config": "mlp_ecommerce",
  "dataset_config.pkid_col_name": "pkid",
  "dataset_config.index_title_col_name": "name",
  "dataset_config.index_desc_col_name": null,
  "dataset_config.index_body_col_name": null,
  "dataset_config.query_col_name": "query",
  "dataset_config.weight_col_name": "aggr_count",
  "dataset_config.metrics_config.monitor_metric": "ndcg@5",
  "trainer_config": "mlp_ecommerce_rnn",
  "trainer_config/text_processor_config": "bpe_ja_small",
  "trainer_config.encoder_config.emb_trainable": true,
  "trainer_config.encoder_config.emb_spdp": 0.3,
  "trainer_config.encoder_config.rnn_names_list": ["gru"],
  "trainer_config.encoder_config.rnn_units_list": [128],
  "trainer_config.num_epochs": 64,
  "trainer_config.monitor_patience": 16,
  "trainer_config.trn_batch_size": null
}

Ecommerce configuration with 2 RNN layers and 128 output vector size

This configuration uses all of the defaults for ecommerce RNN training except "trainer_config.encoder_config.rnn_names_list": ["gru", "gru"], which adds an additional GRU layer, and "trainer_config.encoder_config.rnn_units_list": [128, 64]", which specifies 64 units for the second GRU layer. No other values deviate from the defaults.This configuration is sufficient if all of these apply:

Index Parquet file contains the pkid and text columns
Query Parquet file contains the pkid and query columns
The output of the model is 128 vector dimension

Basic configuration
Advanced configuration

{
  "dataset_config": "mlp_ecommerce",
  "trainer_config": "mlp_ecommerce_rnn",
  "trainer_config.encoder_config.rnn_names_list": ["gru", "gru"],
  "trainer_config.encoder_config.rnn_units_list": [128, 64]
}

{
  "dataset_config": "mlp_ecommerce",
  "dataset_config.pkid_col_name": "pkid",
  "dataset_config.index_title_col_name": "name",
  "dataset_config.index_desc_col_name": null,
  "dataset_config.index_body_col_name": null,
  "dataset_config.query_col_name": "query",
  "dataset_config.weight_col_name": "aggr_count",
  "dataset_config.metrics_config.monitor_metric": "ndcg@5",
  "trainer_config": "mlp_ecommerce_rnn",
  "trainer_config/text_processor_config": "word_en",
  "trainer_config.encoder_config.emb_trainable": true,
  "trainer_config.encoder_config.emb_spdp": 0.3,
  "trainer_config.encoder_config.rnn_names_list": ["gru", "gru"],
  "trainer_config.encoder_config.rnn_units_list": [128, 64],
  "trainer_config.num_epochs": 64,
  "trainer_config.monitor_patience": 16,
  "trainer_config.trn_batch_size": null
}

General configuration with all_minilm_l6_rnn transformer

This configuration uses the all_minilm_l6_rnn transformer.

{
  "dataset_config": "mlp_ecommerce",
  "trainer_config": "all_minilm_l6_rnn"
}

Ecommerce configuration with all_minilm_l6_rnn transformer

This configuration uses the all_minilm_l6_rnn transformer.

{
  "dataset_config": "mlp_ecommerce",
  "trainer_config": "all_minilm_l6_rnn"
}

Get Started

Lucidworks Platform

Lucidworks AI

Core Settings

Agent Studio

Commerce Studio

Analytics Studio

Custom configuration

Custom configuration

Custom configuration scenarios

Custom configuration parameters

Model type

Model parameters

Advanced custom configuration parameters

Advanced model parameters

Custom configuration examples

Configuration

Get Started

Lucidworks Platform

Lucidworks AI

Core Settings

Agent Studio

Commerce Studio

Analytics Studio

​Custom configuration

​Custom configuration scenarios

​Custom configuration parameters

​Model type

​Model parameters

​Advanced custom configuration parameters

​Advanced model parameters

​Custom configuration examples

​Configuration

Custom configuration

Custom configuration scenarios

Custom configuration parameters

Model type

Model parameters

Advanced custom configuration parameters

Advanced model parameters

Custom configuration examples

Configuration