- Job flow
- Best practices
This job analyzes how your existing documents or signals are categorized and produces a classification model that can be used to predict the categories of new documents or queries.
This job takes raw text and an associated single class as input. Although it trains on single classes, there is an option to predict the top several classes with their scores.
At a minimum, you must configure these:
An ID for this job
A Method; Logistic Regression is the default
A Model Deployment Name
The Training Collection
The Training collection content field, the document field containing the raw text
The raw text field that you choose depends on your use case and the types of queries that your users commonly make.
For example, you could choose the description field if users tend to make descriptive queries like "4k TV" or "soft waterproof jacket". But if users are more likely to search for specific brands or products, such as "LG TV" or "North Face jacket", then the product name field might be more suitable.
The Training collection class field containing the classes, labels, or other category data for the text
The first part of the job is vectorization which is the same for all available classification algorithms. Mainly it supports two types of featurization:
Character-based - for queries or short texts, like document titles, sentences, and so on.
Word-based - for long texts like paragraphs, documents, and so on.
The second part is classification algorithms:
Logistic Regression - A classical algorithm with a good trade-off between training speed and results quality. It provides a robust baseline out of the box. Consider using it as a first choice.
StarSpace - A deep learning algorithm that jointly trains to maximize similarity between text and correct class and minimize similarity between text and incorrect classes. This usually requires more tuning and time for training, but with potentially more accurate results. Consider using it and then tuning it if better results are needed.
The third part of the job deploys the new classification model to Fusion using Seldon Core.
These tips describe how to tune the options under Vectorization Parameters for best results with different use cases.
Query intent / short texts
If you want to train a model to predict query intents or to do short text classification, then enable Use Characters.
Another vectorization parameter that can improve model quality is Max Ngram size, with reasonable defaults between 3 and 5.
The more character ngrams are used the bigger the vocabulary, so it is worthwhile to tune the Maximum Vocab Size parameter that controls how many unique tokens will be used. Lower values will make training faster and will prevent overfitting but might provide lower quality too. It’s important to find a good balance.
Activating the advanced Sublinear TF option usually helps if characters are used.
Documents / long texts
If you want to train a model to predict classes for documents or long texts like one or more paragraphs, then uncheck Use Characters.
The reasonable values for word-based Max Ngram size are 2–3. Be sure to tune Maximum Vocab Size parameter too. Usually it’s better to leave the advanced Sublinear TF option deactivated.
If the text is very long and Use Characters is checked, the job may take a lot of memory and possibly fail if the amount of memory requested by the job is not available. This may result in pods being evicted or failing with OOM errors. If you see this happening, try the following:
Uncheck Use Characters.
Reduce the vocabulary size and ngram range of the documents.
Allocate more memory to the pod.
If you are going to train a model via LogisticRegression algorithm, dimensionality reduction usually doesn’t help so it makes sense to leave Reduce Dimensionality unchecked. But scaling seems to improve results, so it’s suggested to activate Scale Features.
For models trained by StarSpace algorithm it’s vice-versa. Dimensionality reduction usually helps to get better results as well as much faster model training. But scaling usually doesn’t help or might make results a little bit worse.