Product Selector

Fusion 5.12
    Fusion 5.12

    Build Training DataJob configuration specifications

    Table of Contents

    Use this job to build training data for query classification by joining signals data with catalog data.

    The output of this job can be used as input for the Classification job, which analyzes how documents are categorized and generates a model. That model can then be used to predict categories of new documents when they are indexed.

    The Build Training Data job can be configured in Collections > Jobs in your Managed Fusion UI instance.

    Enter the following information:

    • Spark Job ID used by the API to reference the job.

    • Location where your signals are stored, the Spark-compatible format for the signals, and filters for the query.

    • Location where your content catalog is stored and the Spark-compatible format of the catalog data.

    • Location and Spark-compatible format of the job output.

    • Field names for the query string, the category and item from the catalog, the signals item ID, and the signals count field.

    • Style of the text analyzer you want to use for the job.

    • Top category proportion in relation to all the categories, and the minimum number of the query category pair counts.

    For detailed configuration steps, see Classify New Queries. That process lets you predict the categories that are most likely to be returned successfully in a query.

    Use this job to build training data for query classification by joining signals with catalog.

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Spark configuration settings.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    fieldToVectorize - stringrequired

    Field containing query strings.

    >= 1 characters

    Default: query_s

    dataFormat - string

    Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

    >= 1 characters

    Default: solr

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to additionally filter signals. For non-solr data source use SPARK SQL FILTER QUERY under Advanced to filter results

    Default: *:*

    sparkSQL - string

    Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

    Default: SELECT * from spark_input

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    dataOutputFormat - string

    Spark-compatible output format (like 'solr', 'parquet', etc)

    >= 1 characters

    Default: solr

    partitionCols - string

    If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    catalogPath - stringrequired

    Catalog collection or cloud storage path which contains item categories.

    catalogFormat - stringrequired

    Spark-compatible format that contains catalog data (like 'solr', 'parquet', 'orc' etc)

    signalsPath - stringrequired

    Signals collection or cloud storage path which contains item categories.

    outputPath - stringrequired

    Output collection or cloud storage path which contains item categories.

    categoryField - stringrequired

    Item category field in catalog.

    catalogIdField - stringrequired

    Item Id field in catalog, which will be used to join with signals

    itemIdField - stringrequired

    Item Id field in signals, which will be used to join with catalog.

    Default: doc_id_s

    countField - stringrequired

    Count Field in raw or aggregated signals.

    Default: aggr_count_i

    topCategoryProportion - number

    Proportion of the top category has to be among all categories.

    Default: 0.5

    topCategoryThreshold - integer

    Minimum number of query,category pair counts.

    >= 1

    exclusiveMinimum: false

    Default: 1

    analyzerConfig - stringrequired

    The style of text analyzer you would like to use.

    Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

    type - stringrequired

    Default: build-training

    Allowed values: build-training