Build Training Data Jobs

Table of Contents

Configuration properties

Use this job to build training data for query classification by joining signals data with catalog data.

The output of this job can be used as input for the Classification job, which analyzes how documents are categorized and generates a model. That model can then be used to predict categories of new documents when they are indexed.

The Build Training Data job can be configured in Collections > Jobs in your Fusion UI instance.

Enter the following information:

Spark Job ID used by the API to reference the job.
Location where your signals are stored, the Spark-compatible format for the signals, and filters for the query.
Location where your content catalog is stored and the Spark-compatible format of the catalog data.
Location and Spark-compatible format of the job output.
Field names for the query string, the category and item from the catalog, the signals item ID, and the signals count field.
Style of the text analyzer you want to use for the job.
Top category proportion in relation to all the categories, and the minimum number of the query category pair counts.

For detailed configuration steps, see Classify New Queries. That process lets you predict the categories that are most likely to be returned successfully in a query.

Configuration properties

Use this job to build training data for query classification by joining signals with catalog.

id - stringrequired

The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

sparkConfig - array[object]

Spark configuration settings.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

fieldToVectorize - stringrequired

Field containing query strings.

>= 1 characters

Default: query_s

dataFormat - string

Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

>= 1 characters

Default: solr

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataFilterQuery - string

Solr query to additionally filter signals. For non-solr data source use SPARK SQL FILTER QUERY under Advanced to filter results

Default: *:*

sparkSQL - string

Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

Default: SELECT * from spark_input

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

dataOutputFormat - string

Spark-compatible output format (like 'solr', 'parquet', etc)

>= 1 characters

Default: solr

partitionCols - string

If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

writeOptions - array[object]

Options used when writing output to Solr or other sources

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

readOptions - array[object]

Options used when reading input from Solr or other sources.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

catalogPath - stringrequired

Catalog collection or cloud storage path which contains item categories.

catalogFormat - stringrequired

Spark-compatible format that contains catalog data (like 'solr', 'parquet', 'orc' etc)

signalsPath - stringrequired

Signals collection or cloud storage path which contains item categories.

outputPath - stringrequired

Output collection or cloud storage path which contains item categories.

categoryField - stringrequired

Item category field in catalog.

catalogIdField - stringrequired

Item Id field in catalog, which will be used to join with signals

itemIdField - stringrequired

Item Id field in signals, which will be used to join with catalog.

Default: doc_id_s

countField - stringrequired

Count Field in raw or aggregated signals.

Default: aggr_count_i

topCategoryProportion - number

Proportion of the top category has to be among all categories.

Default: 0.5

topCategoryThreshold - integer

Minimum number of query,category pair counts.

>= 1

exclusiveMinimum: false

Default: 1

analyzerConfig - stringrequired

The style of text analyzer you would like to use.

Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

type - stringrequired

Default: build-training

Allowed values: build-training