Product Selector

Fusion 5.12
    Fusion 5.12

    Classification Jobs

    This job analyzes how your existing documents are categorized and produces a classification model that can be used to predict the categories of new documents at index time.

    For detailed configuration instructions and examples, see Classify New Documents at Index Time or Classify New Queries.

    This job takes raw text and an associated single class as input. Although it trains on single classes, there is an option to predict the top several classes with their scores.

    At a minimum, you must configure these:

    • An ID for this job

    • A Method; Logistic Regression is the default

    • A Model Deployment Name

    • The Training Collection

    • The Training collection content field, the document field containing the raw text

    • The Training collection class field containing the classes, labels, or other category data for the text

    Lucidworks offers free training to help you get started with Fusion. Check out the Classification course, which focuses on understanding the different classifier models in Fusion:

    Classification

    Visit the LucidAcademy to see the full training catalog.

    Classification at index time

    Used at index time, a classification model can be applied to predict the categories of new, incoming documents. To train a model for this use case, use your main content collection as the training collection. The model requires at least 100 examples in the training data for each category predicted.

    Classification job dataflow (documents)

    Document classification job dataflow

    Once you have run the job, you can specify the model name in the Machine Learning Index Stage.

    Job flow

    The first part of the job is vectorization which is the same for all available classification algorithms. Mainly it supports two types of featurization:

    • Character-based - for queries or short texts, like document titles, sentences, and so on.

    • Word-based - for long texts like paragraphs, documents, and so on.

    The second part is classification algorithms:

    • Logistic Regression. A classical algorithm with a good trade-off between training speed and results quality. It provides a robust baseline out of the box. Consider using it as a first choice.

    • StarSpace. A deep learning algorithm that jointly trains to maximize similarity between text and correct class and minimize similarity between text and incorrect classes. This usually requires more tuning and time for training, but with potentially more accurate results. Consider using it and then tuning it if better results are needed.

    The third part of the job deploys the new classification model to Fusion using Seldon Core.

    Best practices

    These tips describe how to tune the options under Vectorization Parameters for best results with different use cases.

    Query intent / short texts

    If you want to train a model to predict query intents or to do short text classification, then enable Use Characters.

    Another vectorization parameter that can improve model quality is Max Ngram size, with reasonable defaults between 3 and 5.

    The more character ngrams are used the bigger the vocabulary, so it is worthwhile to tune the Maximum Vocab Size parameter that controls how many unique tokens will be used. Lower values will make training faster and will prevent overfitting but might provide lower quality too. It’s important to find a good balance.

    Activating the advanced Sublinear TF option usually helps if characters are used.

    Documents / long texts

    If you want to train a model to predict classes for documents or long texts like one or more paragraphs, then uncheck Use Characters.

    The reasonable values for word-based Max Ngram size are 2-3. Be sure to tune Maximum Vocab Size parameter too. Usually it’s better to leave the advanced Sublinear TF option deactivated.

    Performance tuning

    If the text is very long and Use Characters is checked, the job may take a lot of memory and possibly fail if the amount of memory requested by the job is not available. This may result in pods being evicted or failing with OOM errors. If you see this happening, try the following:

    • Uncheck Use Characters.

    • Reduce the vocabulary size and ngram range of the documents.

    • Allocate more memory to the pod.

    Algorithm-specific

    If you are going to train a model via LogisticRegression algorithm, dimensionality reduction usually doesn’t help so it makes sense to leave Reduce Dimensionality unchecked. But scaling seems to improve results, so it’s suggested to activate Scale Features.

    For models trained by StarSpace algorithm it’s vice-versa. Dimensionality reduction usually helps to get better results as well as much faster model training. But scaling usually doesn’t help or might make results a little bit worse.

    Index pipeline configuration

    Model input transformation script

    /*
    Name of the document field to feed into the model.
    */
    var documentFeatureField = "body_t"
    
    /*
    Model input construction.
    */
    var modelInput = new java.util.HashMap()
    modelInput.put("text", doc.getFirstFieldValue(documentFeatureField))
    modelInput

    Model output transformation script

    var top1ClassField = "top_1_class_s"
    var top1ScoreField = "top_1_score_d"
    
    doc.addField(top1ClassField, modelOutput.get("top_1_class")[0])
    doc.addField(top1ScoreField, modelOutput.get("top_1_score")[0])
    // In case if top_k_predictions are needed
    var top1ClassField = "top_1_class_s"
    var top1ScoreField = "top_1_score_d"
    var topKClassesField = "top_k_classes_ss"
    var topKScoresField = "top_k_scores_ds"
    
    var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
    var parsedOutput = {};
    for (var i=0; i<jsonOutput["names"].length;i++){
      parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
    }
    
    doc.addField(top1ClassField, parsedOutput["top_1_class"][0])
    doc.addField(top1ScoreField, parsedOutput["top_1_score"][0])
    if ("top_k_classes" in parsedOutput) {
        doc.addField(topKClassesField, new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
        doc.addField(topKScoresField, new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
    }

    Query pipeline configuration

    Model input transformation script

    var modelInput = new java.util.HashMap()
    modelInput.put("text", request.getFirstParam("q"))
    modelInput

    Model output transformation script

    // To put into request
    request.putSingleParam("class", modelOutput.get("top_1_class")[0])
    request.putSingleParam("score", modelOutput.get("top_1_score")[0])
    
    // Or for example to apply Filter Query
    request.putSingleParam("fq", "class:" + modelOutput.get("top_1_class")[0])
    // To put into query context
    context.put("class", modelOutput.get("top_1_class")[0])
    context.put("score", modelOutput.get("top_1_score")[0])
    // To put into response documents (can be done only after Solr Query stage)
    var docs = response.get().getInnerResponse().getDocuments();
    var ndocs = new java.util.ArrayList();
    
    for (var i=0; i<docs.length;i++){
      var doc = docs[i];
      doc.putField("query_class", modelOutput.get("top_1_class")[0])
      doc.putField("query_score", modelOutput.get("top_1_score")[0])
      ndocs.add(doc);
    }
    
    response.get().getInnerResponse().updateDocuments(ndocs);
    // In case if top_k_predictions are needed
    // To put into response documents (can be done only after Solr Query stage)
    var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
    var parsedOutput = {};
    for (var i=0; i<jsonOutput["names"].length;i++){
      parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
    }
    
    var docs = response.get().getInnerResponse().getDocuments();
    var ndocs = new java.util.ArrayList();
    for (var i=0; i<docs.length;i++){
      var doc = docs[i];
      doc.putField("top_1_class", parsedOutput["top_1_class"][0])
      doc.putField("top_1_score", parsedOutput["top_1_score"][0])
      if ("top_k_classes" in parsedOutput) {
        doc.putField("top_k_classes", new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
        doc.putField("top_k_scores", new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
      }
      ndocs.add(doc);
    }
    response.get().getInnerResponse().updateDocuments(ndocs);

    Trains a classification model to classify text documents by assigning a label to them.

    id - stringrequired

    The ID for this job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Provide additional key/value pairs to be injected into the training JSON map at runtime. Values will be inserted as-is, so use " to surround string values

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    stopwordsBlobName - string

    Name of the stopwords blob resource. This is a .txt file with one stopword per line. By default the file is called stopwords/stopwords_en.txt however a custom file can also be used. Check documentation for more details on format and uploading to blob store.

    Default: stopwords/stopwords_en.txt

    trainingCollection - stringrequired

    Solr collection or cloud storage path where training data is present.

    >= 1 characters

    trainingFormat - stringrequired

    The format of the training data - solr, parquet etc.

    >= 1 characters

    Default: solr

    secretName - string

    Name of the secret used to access cloud storage as defined in the K8s namespace

    >= 1 characters

    textField - stringrequired

    Solr field name containing the text to be classified

    >= 1 characters

    labelField - stringrequired

    Solr field name containing the classes/labels for the text

    >= 1 characters

    trainingDataFilterQuery - string

    Solr or SQL query to filter training data. Use solr query when solr collection is specified in Training Path. Use SQL query when cloud storage location is specified. The table name for SQL is `spark_input`.

    randomSeed - integer

    Pseudorandom determinism fixed by keeping this seed constant

    Default: 12345

    trainingSampleFraction - number

    Choose a fraction of the data for training.

    <= 1

    exclusiveMaximum: false

    Default: 1

    deployModelName - stringrequired

    Name of the model to be used for deployment (must be a valid lowercased DNS subdomain with no underscores).

    <= 30 characters

    Match pattern: ^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$

    workflowType - stringrequired

    Method to be used for classification.

    Default: Logistic Regression

    Allowed values: Logistic RegressionStarspace

    minCharLen - integer

    Minimum length, in characters, for the text to be included into training.

    >= 1

    exclusiveMinimum: false

    Default: 2

    maxCharLen - integer

    Maximum length, in characters, of the training text. Texts longer than this value will be truncated.

    >= 1

    exclusiveMinimum: false

    Default: 100000

    lowercaseTexts - boolean

    Select if you want the text to be lowercased

    Default: true

    unidecodeTexts - boolean

    Select if you want the text to be unidecoded

    Default: true

    minClassSize - integer

    Minimum number of samples that class should have to be included into training. Otherwise the class and all its samples are dropped.

    >= 2

    exclusiveMinimum: false

    Default: 5

    valSize - number

    Size of the validation dataset. Provide a float (0, 1) if you want to sample as a fraction, or an integer >= 1 if you want to sample exact number of records.

    Default: 0.1

    topK - integer

    Number of most probable output classes to assign to each sample along with their scores.

    >= 1

    exclusiveMinimum: false

    Default: 1

    featurizerType - string

    The type of featurizer to use. TFIDF will compute both term-frequency and inverse document-frequency, whereas Count will use only term-frequency

    Default: tfidf

    Allowed values: tfidfcount

    useCharacters - boolean

    Whether to use the characters or word analyzer. Use words if the text is long. Using characters on long text can significantly increase vectorization time and memory requirements.

    Default: true

    minDf - number

    Minimum Df for token to be considered. Provide a float (0,1) if you want to specify as a fraction, otherwise integer >= 1 to specify the exact number of documents in which a token should occur.

    Default: 1

    maxDf - number

    Maximum Df for token to be considered. Provide a float (0,1) if you want to specify as a fraction, otherwise integer >= 1 to specify the exact number of documents in which a token should occur

    Default: 0.8

    minNgram - integer

    Minimum word or character ngram size to be used.

    >= 1

    exclusiveMinimum: false

    maxNgram - integer

    Maximum word or character ngram size to be used.

    >= 1

    exclusiveMinimum: false

    maxFeatures - integer

    Maximum number of tokens (including word or character ngrams) to consider for the vocabulary. Less frequent tokens will be omitted.

    >= 1

    exclusiveMinimum: false

    Default: 250000

    norm - string

    Select the norm method to use.

    Default: None

    Allowed values: NoneL1L2

    smoothIdf - boolean

    Smooth IDF weights by adding one to document frequencies. Prevents zero divisions.

    Default: true

    sublinearTf - boolean

    Whether to apply sublinear scaling to TF, i.e. replace tf with 1 + log(tf). It usually helps when characters are used.

    Default: true

    scaling - boolean

    Whether to apply Standard Scaling (X - mean(X)) / std(X) for the features. If the feature vector is sparse (no dimensionality reduction is used), then only division on standard deviation will be applied.

    Default: true

    dimReduction - boolean

    Whether to perform dimensionality reduction or not. Truncated SVD is used to reduce dimensionality. Reduces overfitting and training time. Note that sparse vectors will become dense.

    Default: false

    dimReductionSize - integer

    The target dimension size of the features after dimensionality reduction.

    >= 1

    exclusiveMinimum: false

    Default: 256

    penalty - string

    Specify the norm used in the penalization. l2 is supported only by the ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers. ‘elasticnet’ is only supported by the ‘saga’ solver. Select none, if you don't want to regularize (this is not supported by the `liblinear` solver).

    Default: l2

    Allowed values: l1l2elsaticnetnone

    l1Ratio - number

    Only used with the `elasticnet` penalty. If its value = 0, l2 penalty will be used. If it's value = 1, l1 penalty will be used. A value in between will use the appropirate ratio of l1 and l2 penalties.

    <= 1

    exclusiveMaximum: false

    Default: 0.5

    tol - number

    Tolerance for stopping criteria.

    Default: 0.0001

    reg - number

    This is the inverse of regularization strength. Smaller values result in stronger regularization.

    Default: 1

    useClassWeights - boolean

    If true, a weight is applied to each class inversely proportional to its frequency.

    Default: false

    solver - string

    The optimization algorithm to use to fit to the data. LBFGS and SAGA are good initial choices.

    Default: lbfgs

    Allowed values: lbfgsnewton-cgliblinearsagsaga

    multiClass - string

    Whether to train a binary classifier for each class or use a multinomial loss. ‘auto’ selects ‘ovr’ if the data is binary, or if algorithm=’liblinear’, and otherwise selects ‘multinomial’.

    Default: auto

    Allowed values: autoovrmultinomial

    maxIter - integer

    Maximum number of iterations taken for the optimization algorithm to converge.

    >= 1

    exclusiveMinimum: false

    Default: 200

    textLayersSizes - string

    Sizes of hidden layers before the embedding layer for text. Specify as a list of numbers for multiple layers or a single number for 1 layer. Leave blank if no hidden layers are required.

    Match pattern: ^(\[(((\d)*,\s*)*(\d+)+)?\])?$

    Default: [256, 128]

    labelLayersSizes - string

    Sizes of hidden layers before the embedding layer for classes. Specify as a list of numbers for multiple layers or a single number for 1 layer. Leave blank if no hidden layers are required.

    Match pattern: ^(\[(((\d)*,\s*)*(\d+)+)?\])?$

    Default: []

    embeddingsSize - integer

    Dimension size of final embedding vectors for text and class.

    >= 1

    exclusiveMinimum: false

    Default: 100

    regTerm - number

    Scale of L2 regularization

    Default: 0.002

    dropout - number

    Probability for applying dropout regularization.

    Default: 0.2

    embeddingReg - number

    The scale of how critical the algorithm should be of minimizing the maximum similarity between embeddings of different classes

    Default: 0.8

    minBatchSize - integer

    The smallest batch size with which to start training. Batch size will be increased linearly every epoch, upto the maximum batch size specified.

    >= 1

    exclusiveMinimum: false

    Default: 64

    maxBatchSize - integer

    The largest batch size to use during training. Batch size will be increased linearly every epoch, upto the maximum batch size specified.

    >= 1

    exclusiveMinimum: false

    Default: 128

    numEpochs - integer

    Number of epochs for which to train the model.

    >= 1

    exclusiveMinimum: false

    Default: 40

    muPos - number

    How similar algorithm should try to make embedding vectors for correct classes. The algorithm will try to maximize similarities so that it's higher than the value specified here.

    <= 1

    exclusiveMaximum: false

    Default: 0.8

    muNeg - number

    How similar algorithm should try to make embedding vectors for negative classes. The algorithm will try to minimize similarities so that it's lower than the value specified here.

    <= 1

    exclusiveMaximum: false

    Default: -0.4

    similarityType - string

    Type of similarity to use to compare the embedded vectors.

    Default: cosine

    Allowed values: cosineinner

    numNeg - integer

    Number of negative classes to use during training to minimize their similarity to the input text. Should be less than the total number of classes.

    >= 1

    exclusiveMinimum: false

    useMaxNegSim - boolean

    If true, only the maximum similarity for negative classes will be minimized. If unchecked, all negative similarities will be used.

    Default: true

    modelReplicas - integer

    How many replicas of the model should be deployed by Seldon Core

    >= 1

    exclusiveMinimum: false

    Default: 1

    type - stringrequired

    Default: argo-classification

    Allowed values: argo-classification