Smart Answers
Train a Smart Answers Supervised Model
GPU | CPU |
|
|
model_training_input
. Otherwise you can use it directly from the cloud storage.
word_custom
or bpe_custom
. This trains Word2vec on the provided data and specified fields. It might be useful in cases when your content includes unusual or domain-specific vocabulary.
If you have content in addition to the query/response pairs that can be used to train the model, then specify it in the Texts Data Path.
When you use the pre-trained embeddings, the log shows the percentage of processed vocabulary words. If this value is high, then try using custom embeddings.
The job trains a few (configurable) RNN layers on top of word embeddings or fine-tunes a BERT model on the provided training data. The result model uses an attention mechanism to average word embeddings to obtain the final single dense vector for the content.
Encoder output dim size:
line. You might need this information when creating collections in Milvus.random_*
dynamic field defined in its managed-schema.xml
. This field is required for sampling the data. If it is not present, add the following entry to the managed-schema.xml
alongside other dynamic fields <dynamicField name="random_*" type="random"/>
and <fieldType class=“solr.RandomSortField” indexed=“true” name=“random”/> alongside other field types.{app_name}-smart-answers
pipelines, you need to create collections in Milvus. Please, refer to the Milvus documentation page.
{app_name}-smart-answers
index pipeline to help generate a dense vector representation of answers.{app_name}-smart-answers
query pipeline to conduct run-time neural search. This pipeline transforms the incoming query into a dense vector using the trained model, then compares it with indexed answer dense vectors by computing the cosine distance between them. You can also use a query stage to combine Solr and document vector similarity scores at query time.Configure the Smart Answers pipelines
Milvus
collection.smart-answers
index pipeline.smart-answers
query pipeline.Smart Answers Pre-trained Coldstart
models outputs vectors of 512 dimension size. Dimensionality of encoders trained by Smart Answers Supervised Training
job depends on the provided parameters and printed in the training job logs.Create Collections in Milvus
job can be used to create multiple collections at once. In this image, the first collection is used in the indexing and query steps. The other two collections are used in the example.
Field to Encode
to the document field name to be processed and encoded into dense vectors.
Encoder Output Vector
matches the output vector from the chosen model.
Milvus Collection Name
matches the collection name created via the Create Milvus Collection
job.
Fail on Error
in the Encode into Milvus
stage and Apply the changes. This will cause an error message to display if any settings need to be changed.Encoder Output Vector
matches the output vector from the chosen model.
Milvus Collection Name
matches the collection name created via the Create Milvus Collection
job.
Milvus Results Context Key
can be changed as needed. It will be used in the Milvus Ensemble Query Stage to calculate the query score.
Ensemble math expression
as needed based on your model and the name used in the prior stage for the storing the Milvus results.
You can also set the Threshold
so that the Milvus Ensemble Query Stage will only return items with a score greater than or equal to the configured value.
smart-answers
index and query pipelines with a few additional changes.Prior to configuring the Smart Answers pipelines, use the Create Milvus Collection
job to create two collections, question_collection
and answer_collection
, to store the encoded “questions” and the encoded “answers”, respectively.Field to Encode
to be title_t
and change the Milvus Collection Name
to match the new Milvus collection, question_collection
.In the Encode Answer stage, specify Field to Encode
to be description_t
and change the Milvus Collection Name
to match the new Milvus collection, answer_collection
.Milvus Results Context Key
needs to be different in each of these two stages.In the Query Questions stage, we set the Milvus Results Context Key
to milvus_questions
and the Milvus collection name to question_collection
.Query Questions (Milvus Query) stage:Milvus Results Context Key
to milvus_answers
and the Milvus collection name to answer_collection
.Query Answers (Milvus Query) stage:Ensemble math expression
combining the results from the two query stages. If we want the question scores and answer scores weighted equally, we would use: 0.5 * milvus_questions + 0.5 * milvus_answers
.
This is recommended especially when you have limited FAQ dataset and want to utilize both question and answer information.Milvus Ensemble Query stage”smart-answers” index pipeline | ![]() | Encode into Milvus stage |
”smart-answers” query pipeline | ![]() |
Job ID
. A unique identifier for the job.Collection Name
. A name for the Milvus collection you are creating. This name is used in both the Smart Answer Index and the Smart Answer Query pipelines.Dimension
. The dimension size of the vectors to store in this Milvus collection. The Dimension should match the size of the vectors returned by the encryption model. For example, if the model was created with either the Smart Answers Coldstart Training
job or the Smart Answers Supervised Training
job with the Model Base word_en_300d_2M
, then the dimension would be 300.Index file size
. Files with more documents than this will cause Milvus to build an index on this collection.Metric
. The type of metric used to calculate vector similarity scores. Inner Product
is recommended. It produces values between 0 and 1, where a higher value means higher similarity.Field to Encode
and store it in Milvus in the given Milvus collection.
There are several required parameters:Model ID
. The ID of the model.Encoder Output Vector
. The name of the field that stores the compressed dense vectors output from the model. Default value: vector
.Field to Encode
. The text field to encode into a dense vector, such as answer_t
or body_t
.Milvus Collection Name
. The name of the collection you created via the Create Milvus Collection job, which will store the dense vectors. When creating the collection you specify the type of Metric to use to calculate vector similarity.
This stage can be used multiple times to encode additional fields, each into a different Milvus collection. See how to index and retrieve the question and answer together.Model ID
. The ID of the model used when configuring the model training job.Encoder Output Vector
. The name of the output vector from the specified model, which will contain the query encoded as a vector. Defaults to vector.Milvus Collection Name
. The name of the collection that you used in the Encode into Milvus
index stage to store the encoded vectors.Milvus Results Context Key
. The name of the variable used to store the vector distances. It can be changed as needed. It will be used in the Milvus Ensemble Query Stage to calculate the query score for the document.Number of Results
. The number of highest scoring results returned from Milvus.
This stage would typically be used the same number of times that the Encode into Milvus
index stage is used, each with a different Milvus collection and a different Milvus Results Context Key
.ensemble score
, which is used to return the best matches.Ensemble math expression
. The mathematical expression used to calculate the ensemble score
. It should reference the value(s) variable name specified in the Milvus Results Context Key
parameter in the Milvus Query stage.Result field name
. The name of the field used to store the ensemble score
. It defaults to ensemble_score
.Threshold
- A parameter that filters the stage results to remove items that fall below the configured score. Items with a score at, or above, the threshold will be returned.ensemble_score
, into each of the returned documents, which is particularly useful when there is more than one Milvus Query Stage
. This stage needs to come after the Solr Query
stage.Evaluate a Smart Answers Query Pipeline
sa_test_input
and index the test data into that collection.
sa_test_output
.
sa-pipeline-evaluator
.
sa_test_input
) in the Input Evaluation Collection field.
sa_test_output
) in the Output Evaluation Collection field.
max
log10
pow0.5
mrr@3
.
Extract short answers from longer documents
deploy-answer-extractor
.answer-extractor
.lucidworks
answer-extractor:v1.1
[answer,score,start,end]
APP_NAME-smart-answers
.Starting with one of those pipelines, add a new Machine Learning stage to the end of the pipeline and configure it as described below.How to configure short answer extraction in the query pipelinequestion
(Required). The name of the field containing the questions.
context
(Required). A string or list of contexts; by default this is the first num_docs_to_extract
documents in the output of the previous stage in the pipeline.
If only one question is present with multiple contexts, that question will be applied to every context and vice versa for 1 context and multiple questions. If a list of questions and contexts is passed, a 1:1 mapping of questions and contexts will be created in the order in which they’re passed.
topk
. The number of answers to return (will be chosen by order of likelihood). Default: 1
handle_impossible_answer
. Whether or not to deal with a question that has no answer in the context.
If true, an empty string is returned. If false, the most probable (topk
) answer(s) are returned regardless of how low the probability score is. Default: True
batch_size
. How many samples to process at a time. Reducing this number will reduce memory usage but increase execution time, while increasing it will increase memory usage and decrease execution time to a certain extent. Default: 8max_context_len
. If set to greater than 0, truncate contexts to this length in characters. Default: 5000max_answer_len
. The maximum length of predicted answers (for example, only answers with a shorter length are considered). Default: 15max_question_len
. The maximum length of the question after tokenization. It will be truncated if needed. Default: 64doc_stride
. If the context is too long to fit with the question for the model, it will be split in several chunks with some overlap. This argument controls the size of that overlap. Default: 128max_seq_len
. The maximum length of the total sentence (context + question) after tokenization. The context will be split in several chunks (using doc_stride) if needed. Default: 384answer
. The short answer extracted from the context. This may be blank if handle_impossible_answers=True
and topk=1
.score
. The score for the extracted answers.start
. The start index of the extracted answer in the provided context.end
. The end index of the extracted answer in the provided context.