Best Practices

Table of Contents

How to begin
- Use cases for chatbots, virtual assistants, QnA, and FAQ search
- Use cases for never null and eCommerce
Obtain training data
- Use cases for chatbots, virtual assistants, QnA, and FAQ search
- Use cases for never null and eCommerce
How to evaluate results
How to choose a training model
- RNN models
- BERT models
Model vector size
Hyperparameters to tune the training job
How to choose a Milvus index
- Default index
- HNSW index

Below is a collection of useful tips and tricks that can help improve results.

How to begin

Use cases for chatbots, virtual assistants, QnA, and FAQ search

In most cases, Lucidworks recommends you begin with coldstart models.

The coldstart model is only suitable for English. And the large coldstart model increases query time. So if a moderate query time is acceptable, use a coldstart model.
For multilingual support or fast query time, use a multilingual model.

Coldstart models

Pre-trained coldstart models are:

Robust out-of-the-box models.
Easily configured and do not require training data.
Used to establish a solid baseline solution to compare when running supervised training.

Use cases for never null and eCommerce

For these use cases, the preferred method is to build a training set and use a supervised job. It is usually possible to set up a training dataset. However, if that is not possible, set up an initial coldstart pre-trained model.

The quality of model and results may vary because models are designed to use natural language questions and not short, less defined queries. To enhance the results:

Collect signal data (if it does not exist).
Build a training set.

Obtain training data

There are multiple ways to obtain training data. The supervised training job uses pairs of questions and answers (or queries and product names) to train deep learning encoders.

Use cases for chatbots, virtual assistants, QnA, and FAQ search

For these use cases:

Base training data typically consists of existing FAQ pair information.
After the coldstart solution is deployed, training data can also be derived from sources such as search logs and customer support correspondence. The data can be constructed and expanded manually.
Expanded training data can be obtained using the augmentation techniques in the Data Augmentation job. For natural language scenarios, backtranslation is effective to provide paraphrased versions of the original questions and answers that improve vocabulary and semantic coverage.

Use cases for never null and eCommerce

Training data for these use cases is primarily collected signals data.

There are different ways to construct training data, which determines how the resulting trained model behaves based on the:

Available types of signals.
Quantity of signals.
Needs of your business.

Examples include:

Query and product pair click signals:
- Help increase the click-through rate (CTR).
- Are usually high volume, but low quality.
Add-to-cart and purchase signals:
- Help increase sales.
- Have a very strong relationship between query and products.
- Are high quality, but low volume.
- Produce a resulting trained model that ranks products most likely to be purchased higher than it does other products (in the results generated by the query).
Null-search signals result from products selected after a zero-result search or an abandoned search.
Different training datasets can be combined to train one model to rule them all or results from the different models can be combined at query time.

How to evaluate results

The supervised training job incorporates the evaluation mechanism. A hold out validation set:

Is constructed automatically from unique questions and queries.
Can be sized in the job configuration.
Provides the ability to:
- Evaluate how well models generalize to new, unseen queries.
- Prevent the occurrence of overfitting.

You can also Evaluate a Smart Answers Query Pipeline to:

Ensure configuration is set correctly.
Compare different models or pipelines.
Perform an end-to-end evaluation of the entire query pipeline.

How to choose a training model

Select the model that best fits your training needs.

RNN models

RNN models support most use cases, including digital commerce and heavy query traffic domains. The models also provide:

Easily-understood training.
Extremely fast query time.

While several model cases are available, Lucidworks recommends that in most cases, begin with word_en_300d_2M. If the case requires other language support, use bpe_multilingual or one of the specific bpe models for CJK languages.

BERT models

BERT models:

Are only recommended if you have GPU access. Transformer-based models can only use a single GPU. Even if multiple GPUs are available, this type of model does not scale to multiple GPUs.
Are suitable for natural language queries, chatbots, virtual assistants, and low query traffic domains.
Provide better semantic quality results, especially if only a small amount of training data is available.
Are more expensive to use for training.
Run more slowly at query time than RNN models.

Model vector size

Pre-trained models produce vectors with a dimension of 512.

Vector dimensions for supervised job trained models are highlighted at the end of the training logs, or can be derived from the model and parameters:

BERT-based model vectors have a dimension of 768.
RNN-based model vectors have a dimension of two times the size of the last specified layer.

Hyperparameters to tune the training job

Most of the parameters in the supervised training job contain effective default values. Some of the parameters are automatically determined based on the dataset statistics.

Specific parameters to review are listed below, but see Advanced Model Training Configuration for Smart Answers for detailed information about tuning the training job.

General encoder parameters

These parameters are common to BERT and RNN models. The most significant parameter in this section is Fine-tune Token Embeddings. The parameter is disabled by default to prevent overfitting. However, enable the parameter to improve queries with multiple variations or jargon. For example, queries in the eCommerce domain.

RNN encoder parameters

Dropout Ratio: Tune in the [0.1, 0.5] range to improve regularization and prevent overfitting. The customary value is 0.15 or 0.3.
RNN Function (Units) List: The parameters determine the number of layers to use and the output vector size.

One layer is typically sufficient, and Lucidworks does not recommend using over two layers. Adding a smaller second layer produces a smaller vector size that reduces the vector index size and might reduce query time.

For example, the current configuration is a one-layer network with 128 units (256 vector size) that provides good results. To improve QPS and index a very large collection without losing quality, add a small layer with 64 units (128 vector size). Instead of adding one small layer, you can add two small layers. The two additional small layers slightly increases encoding time, but the vector size is two times smaller in memory, which may improve search time.

Training and evaluation parameters

To reduce training time, especially when training on a CPU, lower the value in following parameters:

Number of Epochs
Monitor Patience

To increase training time, increase the value in the parameters.

The Training Batch Size parameter impacts the timeframe of an epoch and how often validation is performed. Lucidworks recommends setting the following:

At least 100 batches per training epoch
Use bigger batches with larger training datasets
Values of:
- [32, 64] for small training sets
- [128, 256] for medium training sets
- [512, 1024, 2048] for large training sets

Due to the size of BERT-based models, the training batch size values recommended are [8, 16, 32].

How to choose a Milvus index

Milvus supports multiple indexes. See the following Milvus documentation:

Vector index for detailed descriptions of the indexes.
Performance FAQ for performance information.

Default index

The default (FLAT) index provides:

The best possible quality (typically 100% of the model quality), but is the least effective in terms of query time.
A reasonable query time for low traffic use cases under 250 QPS.

HNSW index

The HNSW index:

Supports higher QPS.
Supports collections over a few million documents.
Needs to be configured with the parameter values detailed in Create Indexes in Milvus Jobs.