Job configuration specifications
Input
Output
cluster_id | The IDs associated with each cluster so that we can easily identify the clusters by number. A negative number means the document is an outlier or extremely long. |
cluster_label , cluster_label_txt | Unique keywords assigned to each cluster so that there are no overlapping keywords between the clusters. ● The cluster label long_doc is applied to very lengthy documents.● The cluster label short_doc is applied to very short documents.● Outliers are grouped with cluster labels like outlier_group0 , outlier_group1 , and so on. |
dist_to_center | The document’s distance from its corresponding cluster center. The shorter the distance, the closer the document is to the center of the cluster. |
freq_terms , freq_terms_txt | The most frequent words in the cluster. |
clustering_model_id | The ID of the Document Clustering job that attached these fields. |
dist_to_center
field can be used to sort documents by similarity to the cluster_label
field.
Frequent terms are the terms that appear most frequently in documents in a given cluster. Different clusters may have overlapping frequent terms. Some of the frequent terms may also appear in the cluster label.
kMin=2
and kMax=100
, then the job searches through 2, 7, 12, …, 100 with a step size of 5. A large kMax
can increase the running time. The algorithm incurs a penalty if k is unnecessarily large. You can use the parameter kDiscount
to reduce this penalty and use a larger k-chosen. However, if kMax
is small (for example, 10 or less), then not using a discount (kDiscount=1
) is recommended.analyzerConfig
parameter, you can specify stopword deletion, stemming, short token treatment, and regular expressions. The analyzerConfig
is used in the featurization step.outlier_group0
, outlier_group1
… outlier_groupN
, where N is the value of Number of Outlier Groups.cluster_label
.cluster_label
field contains the five terms that best define the center of each cluster.freq_terms
.freq_terms
field contains the five terms that appear most often in each cluster.dist_to_center
field can be used to sort documents by similarity to their cluster labels.long_doc
cluster and any outlier_group<n>
clusters. If no outliers are detected, consider increasing outlierK
or outlierThreshold
.