Below is a collection of useful tips and tricks that can help improve results.
It makes sense to try enabling and disabling the Filter Stop Words stage since some stop words may have special meanings for a particular data. Also, although this stage is usually applied after the Deep Encoding stage and before the Solr query stage, sometimes it might help to apply it before the Deep Encoding stage, especially in the cold-start scenario.
We noticed a reasonable improvement if the
text_en_splitting field type is used for indexed questions or answers. This field type applies internal stemming and stop word removal, as well as split compound words so that initial candidates retrieval from Solr is less restrictive and leads to better results.
If several fields are encoded and used for final score ensembling, some fields might be more important. You might see it based on weights coefficients. In that case, boosting-corresponded text fields might provide better results. For example, boost
question_t to 3 but leave
answer_t to 1.
In some cases, it might be better to train two different models for QA and QQ matching and use ensembling of them. For example, in the case when there is an initial clean QA dataset with FAQ from a website and additional QQ dataset with real user queries mapped to the existing pool of questions.
When there is very limited content data for the cold start solution to learn the vocabulary, but there are lists of spelling and synonym in the rules engine which can be used to improve results, we suggest to adding a Text Tagger stage into the query pipeline (after Query Encoding stage, before Solr Query stage).
|Set the Original Term Boost for Synonyms parameter to -1.|