Best Practices
Below is a collection of useful tips and tricks that can help improve results.
Stopwords usage
It makes sense to try enable and disable filtering stop words stage since some stop words may have special meanings for a particular data. Also, although this stage is usually applied after the Deep Encoding stage and before the Solr query stage, sometimes it might help to apply it before the Deep Encoding stage, especially in the cold-start scenario.
Solr field types
We noticed a reasonable improvement if the text_en_splitting
field type is used for indexed questions or answers. This field type applies internal stemming and stop word removal, as well as split compound words so that initial candidates retrieval from Solr is less restrictive and leads to better results.
Field boosting
If several fields are encoded and used for final score ensembling, some fields might be more important. You might see it based on weights coefficients. In that case, boosting-corresponded text fields might provide better results. For example, boost question_t
to 3 but leave answer_t
to 1.
Using two models
In some cases, it might be better to train two different models for QA and QQ matching and use ensembling of them. For example, in the case when there is an initial clean QA dataset with FAQ from a website and additional QQ dataset with real user queries mapped to the existing pool of questions.
Using spelling and synonym list for cold start solution
When there is very limited content data for the cold start solution to learn the vocabulary, but there are lists of spelling and synonym in the rules engine which can be used to improve results, we suggest to adding a Text Tagger stage into the query pipeline (after Query Encoding stage, before Solr Query stage).
Set the Original Term Boost for Synonyms parameter to -1. |