Product Selector

Fusion 5.12
    Fusion 5.12

    Pre-trained embedding modelsLucidworks AI

    This feature is currently only available to clients who have contracted with Lucidworks for features related to Neural Hybrid Search and Lucidworks AI.

    Lucidworks AI provides certain pre-trained embedding models. All pre-trained models hosted by Lucidworks are described in this topic. These models are also publicly available and most of the model names have not been changed, so they can be searched under the model name at https://huggingface.co/.

    Model families

    The pre-trained models provided are in the E5, GTE, and BGE families. Each model is different, and you may need to index using several different models to determine what is most effective for your dataset and domain.

    E5

    Models in the E5 family are state-of-the-art text encoders that are satisfactory for a wide range of tasks. For details about the training of E5 models, see https://huggingface.co/intfloat/e5-base-v2.

    E5 Multilingual

    The E5 models family shows very strong quality on our multilingual benchmark. Each document or query in a batch can be in a different language, and different languages can exist within each document or query.

    GTE

    The GTE models are a very strong family of models that surpass the E5 family on public benchmarks. In our benchmarks, it shows comparable quality with the E5 family, depending on the particular data and domain.

    BGE

    The BGE model family shows the best quality among open-source models on public benchmarks. In our benchmarks, it shows comparable quality with the E5 and GTE families, which again depends on the particular dataset. For more information about the training of BGE models see the pre-train repo and https://huggingface.co/.

    Model size guide

    This chart is designed to help determine the best model size for each model family. Factors to consider regarding model size include:

    • Smaller vector sizes use less space in collections than larger vector sizes.

    • Smaller models are easier to scale and have quicker responses, but quality may be less than the base and large models.

    Model size Vector size Quality Performance

    Small

    384

    Medium

    Fast

    Base

    768

    High

    Medium

    Large

    1024

    Very High

    Slow

    Pre-trained embedding models

    The following table lists the available pre-trained embedding (vectorization) models that provide semantic vector search (SVS) using L2 normalized vectors with cosine similarity scoring. Click the model name for detailed information.

    Model names must be all lowercase.
    Family Model Input Vector size Quality Performance

    English text

    384

    Medium

    Fast+

    English text

    768

    High-

    Medium+

    E5

    English text

    384

    Medium

    Fast

    E5

    English text

    768

    High

    Medium

    E5

    English text

    1024

    Very High

    Slow

    E5

    Multilingual text

    384

    Medium

    Fast

    E5

    Multilingual text

    768

    High

    Medium

    E5

    Multilingual text

    1024

    Very High

    Slow

    GTE

    English text

    384

    Medium

    Fast

    GTE

    English text

    768

    High

    Medium

    GTE

    English text

    1024

    Very High

    Slow

    BGE

    English text

    384

    Medium

    Fast

    BGE

    English text

    768

    High

    Medium

    BGE

    English text

    1024

    Very High

    Slow

    Detailed model information

    Each pre-trained model hosted on Lucidworks AI adheres to the certain conventions such as input and output keys, batch limits, normalization. The following conventions are used unless the model information explicity states differently:

    • Model type: pre-trained

    • Input key: text

    • Input type: Supported language text where input text is truncated in the Prediction API to the first 512 tokens

    • Output key: vector

    • Output type: L2 normalized vectors to use with cosine similarity scoring

    • Maximum batch size: 256

    all-minilm-l6-v2

    The all-minilm-l6-v2 model contains 6 layers. It is the fastest model, but also provides the lowest quality. It is smaller than any of the other provided models, including the E5, GTE and BGE small models. Therefore, it provides lower quality but faster performance.

    • Intended use: Use this model for semantic textual search when higher scalability is required.

    • Underlying model: all-MiniLM-L6-v2

    • Vector size: 384

    • Supported language: English

    text-encoder

    The multi-qa-distilbert-cos-v1 model is an efficient general text encoder. The quality and performance range between what the small and base model sizes of the E5, GTE, and BGE families provide.

    • Intended use: Use this model as a general text encoder for semantic vector search if higher quality is needed, but a slower encoding time is not an issue.

    • Underlying model: multi-qa-distilbert-cos-v1

    • Vector size: 768

    • Supported language: English

    e5-small-v2

    The e5-small-v2 model is a smaller version of the E5 encoder and is an effective small-text encoder that provides medium quality and performance.

    • Intended use: Use this model as a starting point for most cases for semantic textual search with 384d vector size to lower resource consumption and decrease vector search times.

    • Underlying model: e5-small-v2

    • Vector size: 384

    • Supported language: English

    e5-base-v2

    The e5-base-v2 model is the base E5 encoder that provides all-around quality and performance.

    • Intended use: Use this medium-sized model as a starting point for semantic textual search. This model provides higher quality results than e5-small-v2 with slower performance.

    • Underlying model: e5-base-v2

    • Vector size: 768

    • Supported language: English

    e5-large-v2

    The e5-large-v2 model should not be used in a heavy load environment because performance time is low.

    • Intended use: Use this large model for semantic vector search where high quality is needed, and slow performance is not an issue. In most cases, it meets or exceeds the quality of OpenAI embeddings such as text-embedding-ada-002.

    • Underlying model: e5-large-v2

    • Vector size: 1024

    • Supported language: English

    multilingual-e5-small

    The multilingual-e5-small model is a smaller version of the E5 Multilingual encoder and is an effective small-text encoder that provides medium quality and performance.

    • Intended use: Use this model as a starting point for multilingual and cross-lingual semantic textual search in over 50 different languages with 384d vector size to lower resource consumption and decrease vector search times.

    • Underlying model: multilingual-e5-small

    • Vector size: 384

    • Supported languages: Afrikaans | Albanian | Amharic | Arabic | Armenian Assamese | Azerbaijani | Basque | Belarusian | Bengali Bengali Romanize | Bosnian | Breton | Bulgarian | Burmese Burmese zawgyi font | Catalan | Chinese (Simplified) | Chinese (Traditional) | Croatian Czech | Danish | Dutch | English | Esperanto Estonian | Filipino | Finnish | French | Galician Georgian | German | Greek | Gujarati | Hausa Hebrew | Hindi | Hindi Romanize | Hungarian | Icelandic Indonesian | Irish | Italian | Japanese | Javanese Kannada | Kazakh | Khmer | Korean | Kurdish (Kurmanji) Kyrgyz | Lao | Latin | Latvian | Lithuanian Macedonian | Malagasy | Malay | Malayalam | Marathi Mongolian | Nepali | Norwegian | Oriya | Oromo Pashto | Persian | Polish | Portuguese | Punjabi Romanian | Russian | Sanskrit | Scottish Gaelic | Serbian Sindhi | Sinhala | Slovak | Slovenian | Somali Spanish | Sundanese | Swahili | Swedish | Tamil Tamil Romanize | Telugu | Telugu Romanize | Thai | Turkish Ukrainian | Urdu | Urdu Romanize | Uyghur | Uzbek Vietnamese | Welsh | Western Frisian | Xhosa | Yiddish

    multilingual-e5-base

    The multilingual-e5-base model is the base E5 Multilingual encoder that provides all-around quality and performance.

    • Intended use: Use this model for multilingual and cross-lingual semantic textual search in over 50 different languages.

    • Underlying model: multilingual-e5-base

    • Vector size: 768

    • Supported languages: Afrikaans | Albanian | Amharic | Arabic | Armenian Assamese | Azerbaijani | Basque | Belarusian | Bengali Bengali Romanize | Bosnian | Breton | Bulgarian | Burmese Burmese zawgyi font | Catalan | Chinese (Simplified) | Chinese (Traditional) | Croatian Czech | Danish | Dutch | English | Esperanto Estonian | Filipino | Finnish | French | Galician Georgian | German | Greek | Gujarati | Hausa Hebrew | Hindi | Hindi Romanize | Hungarian | Icelandic Indonesian | Irish | Italian | Japanese | Javanese Kannada | Kazakh | Khmer | Korean | Kurdish (Kurmanji) Kyrgyz | Lao | Latin | Latvian | Lithuanian Macedonian | Malagasy | Malay | Malayalam | Marathi Mongolian | Nepali | Norwegian | Oriya | Oromo Pashto | Persian | Polish | Portuguese | Punjabi Romanian | Russian | Sanskrit | Scottish Gaelic | Serbian Sindhi | Sinhala | Slovak | Slovenian | Somali Spanish | Sundanese | Swahili | Swedish | Tamil Tamil Romanize | Telugu | Telugu Romanize | Thai | Turkish Ukrainian | Urdu | Urdu Romanize | Uyghur | Uzbek Vietnamese | Welsh | Western Frisian | Xhosa | Yiddish

    multilingual-e5-large

    The multilingual-e5-large model should not be used in a heavy load environment because performance time is low.

    • Intended use: Use this model for multilingual and cross-lingual semantic textual search in over 50 different languages.

    • Underlying model: multilingual-e5-large

    • Vector size: 1024

    • Supported languages: Afrikaans | Albanian | Amharic | Arabic | Armenian Assamese | Azerbaijani | Basque | Belarusian | Bengali Bengali Romanize | Bosnian | Breton | Bulgarian | Burmese Burmese zawgyi font | Catalan | Chinese (Simplified) | Chinese (Traditional) | Croatian Czech | Danish | Dutch | English | Esperanto Estonian | Filipino | Finnish | French | Galician Georgian | German | Greek | Gujarati | Hausa Hebrew | Hindi | Hindi Romanize | Hungarian | Icelandic Indonesian | Irish | Italian | Japanese | Javanese Kannada | Kazakh | Khmer | Korean | Kurdish (Kurmanji) Kyrgyz | Lao | Latin | Latvian | Lithuanian Macedonian | Malagasy | Malay | Malayalam | Marathi Mongolian | Nepali | Norwegian | Oriya | Oromo Pashto | Persian | Polish | Portuguese | Punjabi Romanian | Russian | Sanskrit | Scottish Gaelic | Serbian Sindhi | Sinhala | Slovak | Slovenian | Somali Spanish | Sundanese | Swahili | Swedish | Tamil Tamil Romanize | Telugu | Telugu Romanize | Thai | Turkish Ukrainian | Urdu | Urdu Romanize | Uyghur | Uzbek Vietnamese | Welsh | Western Frisian | Xhosa | Yiddish

    gte-small

    The gte-small model is a smaller version of the GTE encoder and is an effective small-text encoder that provides medium quality and performance.

    • Intended use: Use this model for semantic textual search with 384d vector size to lower resource consumption and decrease vector search times.

    • Underlying model: gte-small

    • Vector size: 384

    • Supported language: English

    gte-base

    The gte-base model is the base GTE encoder that provides all-rounded quality and performance.

    • Intended use: Use this medium-sized model as a starting point for semantic textual search. This model provides higher quality results than gte-small with slower performance.

    • Underlying model: gte-base

    • Vector size: 768

    • Supported language: English

    gte-large

    The gte-large model is the largest GTE encoder and should not be used in a heavy load environment because performance time is low.

    • Intended use: Use this large model for semantic vector search where high quality is needed, and slow performance is not an issue.

    • Underlying model: gte-large

    • Vector size: 1024

    • Supported language: English

    bge-small

    The bge-small-en-v1.5 model is a smaller version of the BGE encoder and is an effective small-text encoder that provides medium quality and performance.

    • Intended use: Use this model for semantic textual search with 384d vector size to lower resource consumption and decrease vector search times.

    • Underlying model: bge-small-en-v1.5

    • Vector size: 384

    • Supported language: English

    bge-base

    The bge-base-en-v1.5 model is the base BGE encoder that provides all-rounded quality and performance.

    • Intended use: Use this medium-sized model as a starting point for semantic textual search. This model provides higher quality results than bge-small-en-v1.5 with slower performance.

    • Underlying model: bge-base-en-v1.5

    • Vector size: 768

    • Supported language: English

    bge-large

    The bge-large-en-v1.5 model is the largest version on the BGE encoder. It should not be used in a heavy load environment because performance time is low.

    • Intended use: Use this large model for semantic vector search where high quality is needed, and slow performance is not an issue.

    • Underlying model: bge-large-en-v1.5

    • Vector size: 1024

    • Supported language: English