Skip to main content
This job analyzes how your existing documents are categorized and produces a classification model that can be used to predict the categories of new documents at index time. For detailed configuration instructions and examples, see Automatically classify new documents at index time or Automatically classify new queries.
You can predict the categories of new documents at index time by using Classification to analyze previously-classified documents in your index and produce a training model, then referencing the model in Machine learning index stage.Classification job dataflow (documents)Document classification job dataflow

How to configure new document classification

  1. Sign in to Managed Fusion and click your application.
  2. Click Collections > Jobs > Add+ > Classification to create a new Classification job.
  3. In the Model Deployment Name field, enter an ID for the new classification model.
  4. In the Training Data Path field, enter the collection name or cloud storage path where your main content is stored.
  5. In the Training Data Format field, leave the default solr value if the Training Data Path is a collection. Otherwise, specify the format of your data in cloud storage.
  6. In the Training collection content field, enter the name of the field that contains the content to analyze. The content field you choose depends on your use case and the typical user query types. For example, you could choose the description field if users tend to make descriptive queries like “4k TV” or “soft waterproof jacket”. But if users are more likely to search for specific brands or products, such as “LG TV” or “North Face jacket”, then the product name field might be more suitable.
  7. In the Training collection class field, enter the name of the field that contains the category data.
    For additional configuration details, see Best practices below.
  8. Save the job.
  9. Specify the model’s name in the Machine learning stage of your index pipeline.
  10. In the Model input transformation script field, enter the following:
/*
Name of the document field to feed into the model.
*/
var documentFeatureField = "body_t"

/*
Model input construction.
*/
var modelInput = new java.util.HashMap()
modelInput.put("text", doc.getFirstFieldValue(documentFeatureField))
modelInput
  1. In the Model output transformation script field, enter the following:
{/* // In case if top_k_predictions are needed */}
var top1ClassField = "top_1_class_s"
var top1ScoreField = "top_1_score_d"
var topKClassesField = "top_k_classes_ss"
var topKScoresField = "top_k_scores_ds"

var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
var parsedOutput = {};
for (var i=0; i<jsonOutput["names"].length;i++){
  parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
}

doc.addField(top1ClassField, parsedOutput["top_1_class"][0])
doc.addField(top1ScoreField, parsedOutput["top_1_score"][0])
if ("top_k_classes" in parsedOutput) {
    doc.addField(topKClassesField, new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
    doc.addField(topKScoresField, new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
}
  1. Click Apply.
  1. Save the query pipeline.

Custom output transformation script example

var top1ClassField = "top_1_class_s"
var top1ScoreField = "top_1_score_d"

doc.addField(top1ClassField, modelOutput.get("top_1_class")[0])
doc.addField(top1ScoreField, modelOutput.get("top_1_score")[0])

Best practices for configuring the Classification job

This job analyzes how your existing documents are categorized and produces a classification model that can be used to predict the categories of new documents at index time.This job takes raw text and an associated single class as input. Although it trains on single classes, there is an option to predict the top several classes with their scores.At a minimum, you must configure these:
  • An ID for this job
  • A Method; Logistic Regression is the default
  • A Model Deployment Name
  • The Training Collection
  • The Training collection content field, the document field containing the raw text
  • The Training collection class field containing the classes, labels, or other category data for the text
You can predict the categories most likely to satisfy a new query using this workflow:
  1. Use the Build Training Data job to join your signals data with your catalog data and produce training data in the form of query/class pairs.
  2. Use the Classification job to train a classification model using the output collection of the Build Training Data job as the training collection.
Query-time classification workflowQuery-time classification workflowSee the detailed steps below.

To predict the categories of new queries

  1. Navigate to Collections > Jobs > Add+ > Build Training Data to create a new Build Training Data job.
  2. Configure the job as follows:
    1. In the Catalog Path field, enter the collection name or cloud storage path where your main content is stored.
    2. In the Catalog Format field, enter solr if you are analyzing a Solr collection, or another format if your content is in the cloud.
    3. In the Signals Path field, enter the collection name or cloud storage path where your signals data is stored.
    4. In the Output Path field, enter the collection name or cloud storage path where you want to store the training data.
    5. In the Category Field in Catalog field, enter the field name for the category data in your main content.
    6. In the Item ID Field in Catalog field, enter the field name for the item IDs in your main content.
    7. Check that the values of Item ID Field in Signals and Count Field in Signals match the field names in your signals data.
  3. Save the job.
  4. Click Run > Start to run the job.
  5. Navigate to Collections > Jobs > Add+ > Classification to create a new Classification job.
  6. Configure the job as follows:
    1. In the Model Deployment Name field, enter an ID for the new classification model.
    2. In the Training Data Path field, enter the collection name or cloud storage path from the Build Training Data job’s Output Path field.
    3. In the Training Data Format field, leave the default solr value if the Training Data Path is a collection or if you used the default format in your Build Training Data job configuration. If you configured the Build Training Data job to output a different format, enter it here.
    4. In the Training collection content field, enter query_s, the default content field name in the Build Training Data job’s output.
    5. In the Training collection class field, enter category_s, the default category field name in the Build Training Data job’s output.
      For additional configuration details, see Best practices below.
  7. Save the job.
  8. Verify that the Build Training Data job has finished successfully.
  9. Click Run > Start to run the job.
  10. Navigate to Indexing > Query Workbench > Load and select your query pipeline.
  11. Configure the query pipeline as follows:
    1. Add a new Machine Learning stage.
    2. In the Model ID field, enter the name from the Classification job’s Model Deployment Name field.
    3. In the Model input transformation script field, enter the following:
    var modelInput = new java.util.HashMap()
    modelInput.put("text", request.getFirstParam("q"))
    modelInput
    
    1. In the Model output transformation script field, enter the following:
    {/* // In case if top_k_predictions are needed */}
    {/* // To put into response documents (can be done only after Solr Query stage) */}
    var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
    var parsedOutput = {};
    for (var i=0; i<jsonOutput["names"].length;i++){
      parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
    }
    
    var docs = response.get().getInnerResponse().getDocuments();
    var ndocs = new java.util.ArrayList();
    for (var i=0; i<docs.length;i++){
      var doc = docs[i];
      doc.putField("top_1_class", parsedOutput["top_1_class"][0])
      doc.putField("top_1_score", parsedOutput["top_1_score"][0])
      if ("top_k_classes" in parsedOutput) {
        doc.putField("top_k_classes", new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
        doc.putField("top_k_scores", new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
      }
      ndocs.add(doc);
    }
    response.get().getInnerResponse().updateDocuments(ndocs);
    
    1. Click Apply.
  1. Save the query pipeline.

Custom output transformation script examples

{/* // To put into request */}
request.putSingleParam("class", modelOutput.get("top_1_class")[0])
request.putSingleParam("score", modelOutput.get("top_1_score")[0])

{/* // Or for example to apply Filter Query */}
request.putSingleParam("fq", "class:" + modelOutput.get("top_1_class")[0])
{/* // To put into query context */}
context.put("class", modelOutput.get("top_1_class")[0])
context.put("score", modelOutput.get("top_1_score")[0])
{/* // To put into response documents (can be done only after Solr Query stage) */}
var docs = response.get().getInnerResponse().getDocuments();
var ndocs = new java.util.ArrayList();

for (var i=0; i<docs.length;i++){
  var doc = docs[i];
  doc.putField("query_class", modelOutput.get("top_1_class")[0])
  doc.putField("query_score", modelOutput.get("top_1_score")[0])
  ndocs.add(doc);
}

response.get().getInnerResponse().updateDocuments(ndocs);

Best practices for configuring the Classification job

This job analyzes how your existing documents are categorized and produces a classification model that can be used to predict the categories of new documents at index time.In addition to the information in this topic, see Classify new documents at index time for configuration and examples.This job takes raw text and an associated single class as input. Although it trains on single classes, there is an option to predict the top several classes with their scores.At a minimum, you must configure these:
  • An ID for this job
  • A Method; Logistic Regression is the default
  • A Model Deployment Name
  • The Training Collection
  • The Training collection content field, the document field containing the raw text
  • The Training collection class field containing the classes, labels, or other category data for the text
This job takes raw text and an associated single class as input. Although it trains on single classes, there is an option to predict the top several classes with their scores. At a minimum, you must configure these:
  • An ID for this job
  • A Method; Logistic Regression is the default
  • A Model Deployment Name
  • The Training Collection
  • The Training collection content field, the document field containing the raw text
  • The Training collection class field containing the classes, labels, or other category data for the text
LucidAcademyLucidworks offers free training to help you get started.The Course for Classification focuses on understanding the different classifier models in Fusion:
ClassificationPlay Button
Visit the LucidAcademy to see the full training catalog.

Classification at index time

Used at index time, a classification model can be applied to predict the categories of new, incoming documents. To train a model for this use case, use your main content collection as the training collection. The model requires at least 100 examples in the training data for each category predicted.
Document classification job dataflow

Classification job dataflow (documents)

Once you have run the job, you can specify the model name in the Machine learning.

Job flow

The first part of the job is vectorization which is the same for all available classification algorithms. Mainly it supports two types of featurization:
  • Character-based - for queries or short texts, like document titles, sentences, and so on.
  • Word-based - for long texts like paragraphs, documents, and so on.
The second part is classification algorithms:
  • Logistic Regression. A classical algorithm with a good trade-off between training speed and results quality. It provides a robust baseline out of the box. Consider using it as a first choice.
  • StarSpace. A deep learning algorithm that jointly trains to maximize similarity between text and correct class and minimize similarity between text and incorrect classes. This usually requires more tuning and time for training, but with potentially more accurate results. Consider using it and then tuning it if better results are needed.
The third part of the job deploys the new classification model to Managed Fusion using Seldon Core.

Best practices

These tips describe how to tune the options under Vectorization Parameters for best results with different use cases.

Query intent / short texts

If you want to train a model to predict query intents or to do short text classification, then enable Use Characters. Another vectorization parameter that can improve model quality is Max Ngram size, with reasonable defaults between 3 and 5. The more character ngrams are used the bigger the vocabulary, so it is worthwhile to tune the Maximum Vocab Size parameter that controls how many unique tokens will be used. Lower values will make training faster and will prevent overfitting but might provide lower quality too. It’s important to find a good balance. Activating the advanced Sublinear TF option usually helps if characters are used.

Documents / long texts

If you want to train a model to predict classes for documents or long texts like one or more paragraphs, then uncheck Use Characters. The reasonable values for word-based Max Ngram size are 2-3. Be sure to tune Maximum Vocab Size parameter too. Usually it’s better to leave the advanced Sublinear TF option deactivated.

Performance tuning

If the text is very long and Use Characters is checked, the job may take a lot of memory and possibly fail if the amount of memory requested by the job is not available. This may result in pods being evicted or failing with OOM errors. If you see this happening, try the following:
  • Uncheck Use Characters.
  • Reduce the vocabulary size and ngram range of the documents.
  • Allocate more memory to the pod.

Algorithm-specific

If you are going to train a model via LogisticRegression algorithm, dimensionality reduction usually doesn’t help so it makes sense to leave Reduce Dimensionality unchecked. But scaling seems to improve results, so it’s suggested to activate Scale Features. For models trained by StarSpace algorithm it’s vice-versa. Dimensionality reduction usually helps to get better results as well as much faster model training. But scaling usually doesn’t help or might make results a little bit worse.

Index pipeline configuration

Model input transformation script
/*
Name of the document field to feed into the model.
*/
var documentFeatureField = "body_t"

/*
Model input construction.
*/
var modelInput = new java.util.HashMap()
modelInput.put("text", doc.getFirstFieldValue(documentFeatureField))
modelInput
Model output transformation script
var top1ClassField = "top_1_class_s"
var top1ScoreField = "top_1_score_d"

doc.addField(top1ClassField, modelOutput.get("top_1_class")[0])
doc.addField(top1ScoreField, modelOutput.get("top_1_score")[0])
// In case if top_k_predictions are needed
var top1ClassField = "top_1_class_s"
var top1ScoreField = "top_1_score_d"
var topKClassesField = "top_k_classes_ss"
var topKScoresField = "top_k_scores_ds"

var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
var parsedOutput = {};
for (var i=0; i<jsonOutput["names"].length;i++){
  parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
}

doc.addField(top1ClassField, parsedOutput["top_1_class"][0])
doc.addField(top1ScoreField, parsedOutput["top_1_score"][0])
if ("top_k_classes" in parsedOutput) {
    doc.addField(topKClassesField, new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
    doc.addField(topKScoresField, new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
}

Query pipeline configuration

Model input transformation script
var modelInput = new java.util.HashMap()
modelInput.put("text", request.getFirstParam("q"))
modelInput
Model output transformation script
// To put into request
request.putSingleParam("class", modelOutput.get("top_1_class")[0])
request.putSingleParam("score", modelOutput.get("top_1_score")[0])

// Or for example to apply Filter Query
request.putSingleParam("fq", "class:" + modelOutput.get("top_1_class")[0])
// To put into query context
context.put("class", modelOutput.get("top_1_class")[0])
context.put("score", modelOutput.get("top_1_score")[0])
// To put into response documents (can be done only after Solr Query stage)
var docs = response.get().getInnerResponse().getDocuments();
var ndocs = new java.util.ArrayList();

for (var i=0; i<docs.length;i++){
  var doc = docs[i];
  doc.putField("query_class", modelOutput.get("top_1_class")[0])
  doc.putField("query_score", modelOutput.get("top_1_score")[0])
  ndocs.add(doc);
}

response.get().getInnerResponse().updateDocuments(ndocs);
// In case if top_k_predictions are needed
// To put into response documents (can be done only after Solr Query stage)
var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
var parsedOutput = {};
for (var i=0; i<jsonOutput["names"].length;i++){
  parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
}

var docs = response.get().getInnerResponse().getDocuments();
var ndocs = new java.util.ArrayList();
for (var i=0; i<docs.length;i++){
  var doc = docs[i];
  doc.putField("top_1_class", parsedOutput["top_1_class"][0])
  doc.putField("top_1_score", parsedOutput["top_1_score"][0])
  if ("top_k_classes" in parsedOutput) {
    doc.putField("top_k_classes", new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
    doc.putField("top_k_scores", new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
  }
  ndocs.add(doc);
}
response.get().getInnerResponse().updateDocuments(ndocs);

Configuration properties

I