Use this job to build training data for query classification by joining signals with catalog.
id - stringrequired
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.
<= 63 characters
Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?
sparkConfig - array[object]
Spark configuration settings.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
fieldToVectorize - stringrequired
Field containing query strings.
>= 1 characters
Default: query_s
dataFormat - string
Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)
>= 1 characters
Default: solr
trainingDataFrameConfigOptions - object
Additional spark dataframe loading configuration options
trainingDataFilterQuery - string
Solr query to additionally filter signals. For non-solr data source use SPARK SQL FILTER QUERY under Advanced to filter results
Default: *:*
sparkSQL - string
Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input
Default: SELECT * from spark_input
trainingDataSamplingFraction - number
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
randomSeed - integer
For any deterministic pseudorandom number generation
Default: 1234
dataOutputFormat - string
Spark-compatible output format (like 'solr', 'parquet', etc)
>= 1 characters
Default: solr
partitionCols - string
If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output
writeOptions - array[object]
Options used when writing output to Solr or other sources
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
readOptions - array[object]
Options used when reading input from Solr or other sources.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
catalogPath - stringrequired
Catalog collection or cloud storage path which contains item categories.
catalogFormat - stringrequired
Spark-compatible format that contains catalog data (like 'solr', 'parquet', 'orc' etc)
signalsPath - stringrequired
Signals collection or cloud storage path which contains item categories.
outputPath - stringrequired
Output collection or cloud storage path which contains item categories.
categoryField - stringrequired
Item category field in catalog.
catalogIdField - stringrequired
Item Id field in catalog, which will be used to join with signals
itemIdField - stringrequired
Item Id field in signals, which will be used to join with catalog.
Default: doc_id_s
countField - stringrequired
Count Field in raw or aggregated signals.
Default: aggr_count_i
topCategoryProportion - number
Proportion of the top category has to be among all categories.
Default: 0.5
topCategoryThreshold - integer
Minimum number of query,category pair counts.
>= 1
exclusiveMinimum: false
Default: 1
analyzerConfig - stringrequired
The style of text analyzer you would like to use.
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
type - stringrequired
Default: build-training
Allowed values: build-training