Product Selector

Fusion 5.12
    Fusion 5.12

    Cluster Labeling Jobs

    Cluster labeling jobs are run against your data collections, and are used:

    • When clusters or well-defined document categories already exist

    • When you want to discover and attach keywords to see representative words within existing clusters

    To create a cluster labeling job, sign in to Fusion and click Collections > Jobs. Then click Add+ and in the Clustering and Outlier Analysis Jobs section, select Cluster Labeling. You can enter basic and advanced parameters to configure the job. If the field has a default value, it is populated when you click to add the job.

    Basic parameters

    To enter advanced parameters in the UI, click Advanced. Those parameters are described in the advanced parameters section.
    • Spark job ID. The unique ID for the Spark job that references this job in the API. This is the id field in the configuration file. Required field.

    • Input/Output Parameters. This section includes these parameters:

      • Training collection. The Solr collection that contains documents associated with defined categories or clusters. The job will be run against this information. This is the trainingCollection field in the configuration file. Required field.

      • Output collection. The Solr collection where the job output is stored. The job will write the output to this collection. This is the outputCollection field in the configuration file. Required field.

      • Data format. The format that contains training data. The format must be compatible with Spark and options include solr, parquet, and orc. Required field.

    • Field Parameters. This section includes these parameters:

      • Field to detect keywords from. The field that contains the data that the job will use to discover keywords for the cluster. This is the fieldToVectorize field in the configuration file. Required field.

      • Existing document category field. The field that contains existing cluster IDs or document categories. This is the clusterIdField field in the configuration file. Required field.

      • Top frequent terms field name. The field where the job output stores top frequent terms in each cluster. Terms may overlap with other clusters. This is the freqTermField field in the configuration file. Optional field.

      • Top unique terms field name. The field where the job output stores the top frequent terms that, for the most part, are unique in each cluster. This is the clusterLabelField field in the configuration file. Optional field.

    • Model Tuning Parameters. This section includes these parameters:

      • Max doc support. The maximum number of documents that can contain the term. Values that are <1.0 indicate a percentage, 1.0 is 100 percent, and >1.0 indicates the exact number. This is the maxDF field in the configuration file. Optional field.

      • Min doc support. The minimum of documents that must contain the term. Values that are <1.0 indicate a percentage, 1.0 is 100 percent, and >1.0 indicates the exact number. This is the minDF field in the configuration file. Optional field.

      • Number of keywords for each cluster. The number of keywords required to label each cluster. This is the numKeywordsPerLabel field in the configuration file. Optional field.

    • Featurization Parameters. This section includes this parameter:

      • Lucene analyzer schema. This is the JSON-encoded Lucene text analyzer schema used for tokenization. This is the analyzerConfig field in the configuration file. Optional field.

    Advanced parameters

    If you click the Advanced toggle, the following optional fields are displayed in the UI.

    • Spark Settings. The Spark configuration settings include:

      • Spark SQL filter query. This field contains the Spark SQL query that filters your input data. For example, SELECT * from spark_input registers the input data as spark_input. This is the sparkSQL field in the configuration file.

      • Data output format. The format for the job output. The format must be compatible with Spark and options include solr and parquet. This is the dataOutputFormat field in the configuration file.

      • Partition fields. If the job output is written to non-Solr sources, this field contains a comma-delimited list of column names that partition the dataframe before the external output is written. This is the partitionCols field in the configuration file.

    • Read Options. This section lets you enter parameter name:parameter value options to use when reading input from Solr or other sources. This is the readOptions field in the configuration file.

    • Write Options. This section lets you enter parameter name:parameter value options to use when writing output to Solr or other sources. This is the writeOptions field in the configuration file.

    • Dataframe config options. This section includes these parameters:

      • Property name:property value. Each entry defines an additional Spark dataframe loading configuration option. This is the trainingDataFrameConfigOptions field in the configuration file.

      • Training data sampling fraction. This is the fractional amount of the training data the job will use. This is the trainingDataSamplingFraction field in the configuration file.

      • Random seed. This value is used in any deterministic pseudorandom number generation to group documents into clusters based on similarities in their content. This is the randomSeed field in the configuration file.

    • Field Parameters. The advanced option adds this parameter:

      • Fields to load. This field contains a comma-delimited list of Solr fields to load. If blank, the job selects the required fields to load at runtime. This is the sourceFields field in the configuration file.

    • Miscellaneous Parameters. This section includes this parameter:

      • Model ID. The unique identifier for the model to be trained. If no value is entered, the Spark Job ID is used. This is the modelId field in the configuration file.

    To create new clusters, use the Document Clustering job.

    Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Spark configuration settings.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    trainingCollection - stringrequired

    Solr Collection containing documents with defined categories or clusters

    >= 1 characters

    fieldToVectorize - stringrequired

    Field containing data from which to discover keywords for the cluster

    >= 1 characters

    dataFormat - stringrequired

    Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

    >= 1 characters

    Default: solr

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to use when loading training data if using Solr

    Default: *:*

    sparkSQL - string

    Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

    Default: SELECT * from spark_input

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    outputCollection - stringrequired

    Solr Collection to store output data to

    >= 1 characters

    dataOutputFormat - string

    Spark-compatible output format (like 'solr', 'parquet', etc)

    >= 1 characters

    Default: solr

    sourceFields - string

    Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

    partitionCols - string

    If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    modelId - string

    Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

    >= 1 characters

    clusterIdField - stringrequired

    Field that contains your existing cluster IDs or document categories.

    >= 1 characters

    analyzerConfig - string

    LuceneTextAnalyzer schema for tokenization (JSON-encoded)

    >= 1 characters

    Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

    clusterLabelField - string

    Output field name for top frequent terms that are (mostly) unique for each cluster.

    Default: cluster_label

    freqTermField - string

    Output field name for top frequent terms in each cluster. These may overlap with other clusters.

    Default: freq_terms

    minDF - number

    Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 5

    maxDF - number

    Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 0.75

    norm - integer

    p-norm to normalize vectors with (choose -1 to turn normalization off)

    Default: 2

    Allowed values: -1012

    numKeywordsPerLabel - integer

    Number of Keywords needed for labeling each cluster.

    Default: 5

    type - stringrequired

    Default: cluster_labeling

    Allowed values: cluster_labeling