Product Selector

Fusion 5.9
    Fusion 5.9

    Document Clustering Jobs

    Cluster a set of documents and attach cluster labels.

    Use this job when you want to cluster a set of documents and attach cluster labels based on topics.

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 128 characters

    Match pattern: ^[A-Za-z0-9_\-]+$

    trainingCollection - stringrequired

    Solr Collection containing documents to be clustered

    >= 1 characters

    fieldToVectorize - stringrequired

    Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.

    >= 1 characters

    dataFormat - string

    Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

    Default: solr

    Allowed values: solrhdfsfileparquet

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to use when loading training data

    >= 3 characters

    Default: *:*

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    outputCollection - stringrequired

    Solr Collection to store model-labeled data to

    >= 1 characters

    sourceFields - string

    Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

    uidField - stringrequired

    Field containing the unique ID for each document.

    >= 1 characters

    Default: id

    clusterIdField - string

    Output field name for unique cluster id.

    Default: cluster_id

    clusterLabelField - string

    Output field name for top frequent terms that are (mostly) unique for each cluster.

    Default: cluster_label

    freqTermField - string

    Output field name for top frequent terms in each cluster. These may overlap with other clusters.

    Default: freq_terms

    distToCenterField - string

    Output field name for doc distance to its corresponding cluster center (measure how representative the doc is).

    Default: dist_to_center

    minDF - number

    Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 5

    maxDF - number

    Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 0.5

    kExact - integer

    Exact number of clusters.

    Default: 0

    kMax - integer

    Max possible number of clusters.

    Default: 20

    kMin - integer

    Min possible number of clusters.

    Default: 2

    docLenTrim - boolean

    Whether to separate out docs with extreme lengths.

    Default: true

    outlierTrim - boolean

    Whether to perform outlier detection.

    Default: true

    shortLen - number

    Length threshold to define short document. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 5

    longLen - number

    Length threshold to define long document. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 0.99

    numKeywordsPerLabel - integer

    Number of Keywords needed for labeling each cluster.

    Default: 5

    modelId - string

    Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

    >= 1 characters

    w2vDimension - integer

    Word-vector dimensionality to represent text (choose > 0 to use, suggested dimension ranges: 100~150)

    exclusiveMinimum: false

    Default: 0

    w2vWindowSize - integer

    The window size (context words from [-window, window]) for word2vec

    >= 3

    exclusiveMinimum: false

    Default: 8

    norm - integer

    p-norm to normalize vectors with (choose -1 to turn normalization off)

    Default: 2

    Allowed values: -1012

    analyzerConfig - string

    LuceneTextAnalyzer schema for tokenization (JSON-encoded)

    >= 1 characters

    Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "patternreplace", "pattern": "^[\\d.]+$", "replacement": " ", "replace": "all" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

    clusteringMethod - string

    Choose between hierarchical vs kmeans clustering.

    Default: hierarchical

    outlierK - integer

    Number of clusters to help find outliers.

    Default: 10

    outlierThreshold - number

    Identify as outlier group if less than this percent of total documents. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 0.01

    minDivisibleSize - number

    Clusters must have at least this many documents to be split further. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 0

    kDiscount - number

    Applies a discount to help favor large/small K (number of clusters). A smaller value pushes K to assume a higher value within the [min, max] K range.

    Default: 0.7

    type - stringrequired

    Default: doc_clustering

    Allowed values: doc_clustering