Lucidworks AI provides certain pre-trained embedding models. These models are also publicly available and most of the model names have not been changed, so they can be searched under the model name at Hugging Face. To generate a list of models for your organization, see the Lucidworks AI Use Case API. To run requests for individual use cases, see LWAI Prediction API.

Model families

The pre-trained models provided are in the E5, GTE, BGE, and ARCTIC families. Each model is different, and you may need to index using several different models to determine what is most effective for your dataset and domain.

E5

Models in the E5 family are state-of-the-art text encoders that are satisfactory for a wide range of tasks. For details about the training of E5 models, see e5-base-v2.

E5 Multilingual

The E5 models family shows very strong quality on our multilingual benchmark. Each document or query in a batch can be in a different language, and different languages can exist within each document or query.

GTE

The GTE models are a very strong family of models that surpass the E5 family on public benchmarks. In our benchmarks, it shows comparable quality with the E5 family, depending on the particular data and domain.

BGE

The BGE model family shows the best quality among open source models on public benchmarks. In our benchmarks, it shows comparable quality with the E5 and GTE families, which again depends on the particular dataset. For more information about the training of BGE models see the pre-train repository and Hugging Face.

ARCTIC

The Snowflake Arctic text embedding models suite focuses on creating high-quality retrieval models, which are highly competitive on public benchmarks. For more information you can visit Hugging Face or their Snowflake Lab arctic embed git repository.

Model size guide

This chart is designed to help determine the best model size for each model family. Factors to consider regarding model size include:
  • Smaller vector sizes use less space in collections than larger vector sizes.
  • Smaller models are easier to scale and have quicker responses, but quality may be less than the base and large models.
Model sizeVector sizeQualityPerformance
Extra Small384MediumFast+
Small384MediumFast
Base768HighMedium
Large1024Very HighSlow

Pre-trained embedding models

The following table lists the available pre-trained embedding (vectorization) models that provide semantic vector search (SVS) using L2 normalized vectors with cosine similarity scoring. Click the model name for detailed information.
Model names must be all lowercase.
FamilyModelInputVector sizeQualityPerformance
all-minilm-l6-v2English text384MediumFast+
text-encoderEnglish text768High-Medium+
E5e5-small-v2English text384MediumFast
E5e5-base-v2English text768HighMedium
E5e5-large-v2English text1024Very HighSlow
E5multilingual-e5-smallMultilingual text384MediumFast
E5multilingual-e5-baseMultilingual text768HighMedium
E5multilingual-e5-largeMultilingual text1024Very HighSlow
GTEgte-smallEnglish text384MediumFast
GTEgte-baseEnglish text768HighMedium
GTEgte-largeEnglish text1024Very HighSlow
BGEbge-smallEnglish text384MediumFast
BGEbge-baseEnglish text768HighMedium
BGEbge-largeEnglish text1024Very HighSlow
ARCTICsnowflake-arctic-embed-xsEnglish text384MediumFast+
ARCTICsnowflake-arctic-embed-sEnglish text384MediumFast
ARCTICsnowflake-arctic-embed-m-v2.0Multilingual text768HighMedium-
ARCTICsnowflake-arctic-embed-l-v2.0Multilingual text1024Very HighSlow

Detailed model information

Each pre-trained model hosted on Lucidworks AI adheres to certain conventions such as input and output keys, batch limits, and normalization. The following conventions are used unless the model information explicitly states differently:
Model type
string
pre-trained
Input key
string
text
Input type
string
Supported language text truncated in the Prediction API to the first 512 tokens
Output key
string
vector
Output type
string
L2 normalized vectors to use with cosine similarity scoring
Maximum batch size
integer
32