Build Training Data Jobs

Use this job to build training data for query classification by joining signals data with catalog data. The output of this job can be used as input for the Classification job, which analyzes how documents are categorized and generates a model. That model can then be used to predict categories of new documents when they are indexed. The Build Training Data job can be configured in Collections > Jobs in your Fusion UI instance. Enter the following information:

Spark Job ID used by the API to reference the job.
Location where your signals are stored, the Spark-compatible format for the signals, and filters for the query.
Location where your content catalog is stored and the Spark-compatible format of the catalog data.
Location and Spark-compatible format of the job output.
Field names for the query string, the category and item from the catalog, the signals item ID, and the signals count field.
Style of the text analyzer you want to use for the job.
Top category proportion in relation to all the categories, and the minimum number of the query category pair counts.

For detailed configuration steps, see Automatically Classify New Queries. That process lets you predict the categories that are most likely to be returned successfully in a query.

Automatically Classify New Queries

You can predict the categories most likely to satisfy a new query using this workflow:

Use the Build Training Data job to join your signals data with your catalog data and produce training data in the form of query/class pairs.
Use the Classification job to train a classification model using the output collection of the Build Training Data job as the training collection.

Query-time classification workflow

See the detailed steps below.

To predict the categories of new queries

Navigate to Collections > Jobs > Add+ > Build Training Data to create a new Build Training Data job.
Configure the job as follows:
1. In the Catalog Path field, enter the collection name or cloud storage path where your main content is stored.
2. In the Catalog Format field, enter solr if you are analyzing a Solr collection, or another format if your content is in the cloud.
3. In the Signals Path field, enter the collection name or cloud storage path where your signals data is stored.
4. In the Output Path field, enter the collection name or cloud storage path where you want to store the training data.
5. In the Category Field in Catalog field, enter the field name for the category data in your main content.
6. In the Item ID Field in Catalog field, enter the field name for the item IDs in your main content.
7. Check that the values of Item ID Field in Signals and Count Field in Signals match the field names in your signals data.
Save the job.
Click Run > Start to run the job.
Navigate to Collections > Jobs > Add+ > Classification to create a new Classification job.
Configure the job as follows:
1. In the Model Deployment Name field, enter an ID for the new classification model.
2. In the Training Data Path field, enter the collection name or cloud storage path from the Build Training Data job’s Output Path field.
3. In the Training Data Format field, leave the default solr value if the Training Data Path is a collection or if you used the default format in your Build Training Data job configuration. If you configured the Build Training Data job to output a different format, enter it here.
4. In the Training collection content field, enter query_s, the default content field name in the Build Training Data job’s output.
5. In the Training collection class field, enter category_s, the default category field name in the Build Training Data job’s output.
  For additional configuration details, see Best practices below.
Save the job.
Verify that the Build Training Data job has finished successfully.
Click Run > Start to run the job.
Navigate to Indexing > Query Workbench > Load and select your query pipeline.

Configure the query pipeline as follows:

Add a new Machine Learning stage.
In the Model ID field, enter the name from the Classification job’s Model Deployment Name field.
In the Model input transformation script field, enter the following:

var modelInput = new java.util.HashMap()
modelInput.put("text", request.getFirstParam("q"))
modelInput

In the Model output transformation script field, enter the following:

{/* // In case if top_k_predictions are needed */}
{/* // To put into response documents (can be done only after Solr Query stage) */}
var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
var parsedOutput = {};
for (var i=0; i<jsonOutput["names"].length;i++){
  parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
}

var docs = response.get().getInnerResponse().getDocuments();
var ndocs = new java.util.ArrayList();
for (var i=0; i<docs.length;i++){
  var doc = docs[i];
  doc.putField("top_1_class", parsedOutput["top_1_class"][0])
  doc.putField("top_1_score", parsedOutput["top_1_score"][0])
  if ("top_k_classes" in parsedOutput) {
    doc.putField("top_k_classes", new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
    doc.putField("top_k_scores", new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
  }
  ndocs.add(doc);
}
response.get().getInnerResponse().updateDocuments(ndocs);

Click Apply.

Save the query pipeline.

Custom output transformation script examples

{/* // To put into request */}
request.putSingleParam("class", modelOutput.get("top_1_class")[0])
request.putSingleParam("score", modelOutput.get("top_1_score")[0])

{/* // Or for example to apply Filter Query */}
request.putSingleParam("fq", "class:" + modelOutput.get("top_1_class")[0])

{/* // To put into query context */}
context.put("class", modelOutput.get("top_1_class")[0])
context.put("score", modelOutput.get("top_1_score")[0])

{/* // To put into response documents (can be done only after Solr Query stage) */}
var docs = response.get().getInnerResponse().getDocuments();
var ndocs = new java.util.ArrayList();

for (var i=0; i<docs.length;i++){
  var doc = docs[i];
  doc.putField("query_class", modelOutput.get("top_1_class")[0])
  doc.putField("query_score", modelOutput.get("top_1_score")[0])
  ndocs.add(doc);
}

response.get().getInnerResponse().updateDocuments(ndocs);

Best practices for configuring the Classification job

Automatically Classify New Documents at Index Time

You can predict the categories of new documents at index time by using the Classification job to analyze previously-classified documents in your index and produce a training model, then referencing the model in the Machine Learning index stage.Classification job dataflow (documents)

How to configure new document classification

Navigate to Collections > Jobs > Add+ > Classification to create a new Classification job.
Configure the job as follows:
1. In the Model Deployment Name field, enter an ID for the new classification model.
2. In the Training Data Path field, enter the collection name or cloud storage path where your main content is stored.
3. In the Training Data Format field, leave the default solr value if the Training Data Path is a collection. Otherwise, specify the format of your data in cloud storage.
4. In the Training collection content field, enter the name of the field that contains the content to analyze.
  The content field that you choose depends on your use case and the types of queries that your users commonly make.
  For example, you could choose the description field if users tend to make descriptive queries like “4k TV” or “soft waterproof jacket”.
  But if users are more likely to search for specific brands or products, such as “LG TV” or “North Face jacket”, then the product name field might be more suitable.
5. In the Training collection class field, enter the name of the field that contains the category data.
  For additional configuration details, see Best practices below.
Save the job.
Specify the model’s name in the Machine Learning stage of your index pipeline.

In the Model input transformation script field, enter the following:

/*
Name of the document field to feed into the model.
*/
var documentFeatureField = "body_t"

/*
Model input construction.
*/
var modelInput = new java.util.HashMap()
modelInput.put("text", doc.getFirstFieldValue(documentFeatureField))
modelInput

In the Model output transformation script field, enter the following:

{/* // In case if top_k_predictions are needed */}
var top1ClassField = "top_1_class_s"
var top1ScoreField = "top_1_score_d"
var topKClassesField = "top_k_classes_ss"
var topKScoresField = "top_k_scores_ds"

var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
var parsedOutput = {};
for (var i=0; i<jsonOutput["names"].length;i++){
  parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
}

doc.addField(top1ClassField, parsedOutput["top_1_class"][0])
doc.addField(top1ScoreField, parsedOutput["top_1_score"][0])
if ("top_k_classes" in parsedOutput) {
    doc.addField(topKClassesField, new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
    doc.addField(topKScoresField, new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
}

Click Apply.

Save the query pipeline.

Custom output transformation script example

var top1ClassField = "top_1_class_s"
var top1ScoreField = "top_1_score_d"

doc.addField(top1ClassField, modelOutput.get("top_1_class")[0])
doc.addField(top1ScoreField, modelOutput.get("top_1_score")[0])

Best practices for configuring the Classification job

This job analyzes how your existing documents are categorized and produces a classification model that can be used to predict the categories of new documents at index time.In addition to the information in this topic, see Automatically classify new queries for configuration information and examples.This job takes raw text and an associated single class as input. Although it trains on single classes, there is an option to predict the top several classes with their scores.At a minimum, you must configure these:

An ID for this job
A Method; Logistic Regression is the default
A Model Deployment Name
The Training Collection
The Training collection content field, the document field containing the raw text
The Training collection class field containing the classes, labels, or other category data for the text

This job takes raw text and an associated single class as input. Although it trains on single classes, there is an option to predict the top several classes with their scores.At a minimum, you must configure these:

An ID for this job
A Method; Logistic Regression is the default
A Model Deployment Name
The Training Collection
The Training collection content field, the document field containing the raw text
The Training collection class field containing the classes, labels, or other category data for the text

Get Started

Introduction to Fusion

Getting Data In

Getting Data Out

Operations

Reference

Developer Docs

Neural Hybrid Search

Release Notes

To predict the categories of new queries

Custom output transformation script examples

Best practices for configuring the Classification job

How to configure new document classification

Custom output transformation script example

Best practices for configuring the Classification job

Configuration properties

Get Started

Introduction to Fusion

Getting Data In

Getting Data Out

Operations

Reference

Developer Docs

Neural Hybrid Search

Release Notes

​To predict the categories of new queries

​Custom output transformation script examples

​Best practices for configuring the Classification job

​How to configure new document classification

​Custom output transformation script example

​Best practices for configuring the Classification job

​Configuration properties

To predict the categories of new queries

Custom output transformation script examples

Best practices for configuring the Classification job

How to configure new document classification

Custom output transformation script example

Best practices for configuring the Classification job

Configuration properties