Product Selector

Fusion 5.12
    Fusion 5.12

    Data Augmentation Job

    Use this job to augment training and/or testing data for use with other jobs, such as Smart Answer, Classification, Recommender.

    This job takes in data specified by the user, performs one or more of the specified augmentation tasks, and writes the output back to Solr or Cloud.

    The Benefits of Augmentation Tasks

    The augmentation tasks can improve the models trained on it by adding the augmented data back into the model, thereby increasing the quantity of training data when there isn’t enough. They can also allow you to test the robustness of the models by training them on the source data, then testing them on augmented data. Both introduce variation that will make the models better equipped to handle different types of text. For more details on this process, see Data Augmentation.

    The amount of extra augmented data generated will depend on the task and the parameters used. In an ideal scenario with one task applied to one field and little to no record filtering, you can expect to double the amount of the original data.

    Task Description

    Each task supports a variety of languages. Refer to the description of each task for details.

    • Backtranslation

      Translates the input data into one or more intermediate languages before translating it back to the source language. The process introduces changes in the syntax and grammar of the input text without changing the semantics. Because this task uses a deep learning model, Facebook’s M2M-100, to perform translations, a GPU is recommended for fast processing.

      If the backtranslation is of poor quality, try increasing the beam size. However, this will consume more memory and take more time. You could also try changing the intermediate languages to use languages that are similar to each other. For example, if your source language is Korean, translating to Chinese and/or Japanese and back might give you better results than translating to Spanish.

      Use the synonym substitution job as an alternative if you’re unable to provision the necessary hardware and/or this job is taking too long. Note that the synonym substitution job does not support the same languages.

      Supported Languages: Chinese, Dutch, English, French, German, Hebrew, Italian, Japanese, Korean, Polish, Spanish, Ukrainian

    Backtranslation with Korean text may result in errors if run on GKE with Kubernetes master version v1.16.15-gke.4901, Kernel version: 4.19.112+, and Container runtime version: docker://19.3.1 on Google’s Container Optimized OS for Docker. To resolve, upgrade to a higher version of K8s master, kernel, and container runtime.
    • Synonym Substitution

      Takes in the input text and substitutes some words with synonyms derived from the included wordner/ppdb dictionaries or user-supplied dictionaries. The user-supplied dictionaries must be submitted in the lucene/solr synonym format as shown in the example below.

      Example synonyms.txt file:

      #some test synonym mappings unlikely to appear in real input text
      aaa => aaaa
      bbb => bbbb1 bbbb2
      ccc => cccc1,cccc2
      a\=>a => b\=>b
      a\,a => b\,b
      fooaaa,baraaa,bazaaa
      
      # Some synonym groups specific to this example
      GB,gib,gigabyte,gigabytes
      MB,mib,megabyte,megabytes
      Television, Televisions, TV, TVs
      #notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
      #after us won't split it into two words.
      
      # Synonym mappings can be used for spelling correction too
      pixima => pixma

      Supported Languages: Chinese, Dutch, English, French, German, Hebrew, Italian, Japanese, Polish, Spanish

    Boosted synonyms are not supported. This synonym mapping file should be uploaded to the blob store, and a blob store path should then be passed to the job.
    • Keystroke Misspelling

      Simulates typos one might make based on the layout of the keyboard. For example, if typing in English on a QWERTY keyboard layout, they might accidentally replace the “y” with a “t” while typing the word “keyboard” because ”y” and “t” are next to each other on the keyboard. Currently, only QWERTY keyboard layouts are supported.

      The user can provide their own keyboard mapping as a JSON file uploaded to the fusion blob store. The JSON file should be in the following format: {“a”:”x”, “b”:”v”, …​}.

      Supported Languages: Dutch, English, French, German, Hebrew, Italian, Polish, Spanish, Ukrainian

    • Split word

      Randomly splits words by introducing a space “ “ at some random point in the word.

      Supported Languages: Dutch, English, French, German, Italian, Polish, Spanish

    Use this job to perform Text Augmentation

    id - stringrequired

    The ID for this job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Provide additional key/value pairs to be injected into the training JSON map at runtime. Values will be inserted as-is, so use " to surround string values

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    trainingCollection - stringrequired

    Solr collection or cloud storage path where training data is present.

    >= 1 characters

    trainingFormat - stringrequired

    The format of the training data - solr, parquet etc.

    >= 1 characters

    trainingDataFilterQuery - string

    Solr or SQL query to filter training data. Use solr query when solr collection is specified in Training Path. Use SQL query when cloud storage location is specified. The table name for SQL is `spark_input`

    randomSeed - integer

    Pseudorandom determinism fixed by keeping this seed constant

    Default: 12345

    trainingSampleFraction - number

    Choose a fraction of the data for training.

    <= 1

    exclusiveMaximum: false

    Default: 1

    batchSize - string

    If writing to solr, this field defines the batch size for documents to be pushed to solr.

    Default: 15000

    outputCollection - stringrequired

    Output collection to store generated augmented data.

    >= 1 characters

    outputFormat - stringrequired

    The format of the output data - solr, parquet etc.

    >= 1 characters

    partitionFields - string

    If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

    secretName - string

    Name of the secret used to access cloud storage as defined in the K8s namespace

    >= 1 characters

    backTranslations - array[object]

    Augment data via translation to a different language and then translating back to original language. Chain of languages can be used for translation. Works at sentence level for medium-long length text. GPU recommended and will be used when available.

    object attributes:{fieldname required : {
     display name: Field Name
     type: string
    }
    inputLanguage required : {
     display name: Input data Language
     type: string
    }
    intermediateLanguage : {
     display name: Intermediate Language
     type: string
    }
    batchSize : {
     display name: Batch Size
     type: integer
    }
    beamSize : {
     display name: Beam Size
     type: integer
    }
    minSentenceLength : {
     display name: Min translation length (tokens)
     type: integer
    }
    maxSentenceLength : {
     display name: Max translation length (tokens)
     type: integer
    }
    }

    keyStrokeMisspellings - array[object]

    Augment data via insertion, substitution, swapping and deletion of characters based on keyboard layout. Useful for short text.

    object attributes:{fieldname required : {
     display name: Field Name
     type: string
    }
    inputLanguage required : {
     display name: Input data Language
     type: string
    }
    minCharAugment : {
     display name: Minimum Chars to Augment
     type: integer
    }
    maxCharAugment : {
     display name: Maximum Chars to Augment
     type: integer
    }
    minWordsToAugment : {
     display name: Min words to Augment
     type: integer
    }
    maxWordsToAugment : {
     display name: Max words to Augment
     type: integer
    }
    wordPercentageToAugment : {
     display name: Percentage words to Augment
     type: number
    }
    keywordsBlobName : {
     display name: Keystroke Mapping
     type: string
    }
    }

    synonymSubstitutions - array[object]

    Augment data via substituting words using synonyms from wordnet or user supplied dictionary. Useful for short, medium and long text. Faster and less resource intensive than back translation.

    object attributes:{fieldname required : {
     display name: Field Name
     type: string
    }
    inputLanguage required : {
     display name: Input data Language
     type: string
    }
    minWordsToAugment : {
     display name: Min words to Augment
     type: integer
    }
    maxWordsToAugment : {
     display name: Max words to Augment
     type: integer
    }
    wordPercentageToAugment : {
     display name: Percentage of words to Augment
     type: number
    }
    stopwordsBlobName : {
     display name: Synonym Dictionary Name
     type: string
    }
    }

    splitWords - array[object]

    Augment data via splitting some words. Useful for short, medium and long text.

    object attributes:{fieldname required : {
     display name: Field Name
     type: string
    }
    inputLanguage required : {
     display name: Input data Language
     type: string
    }
    minWordLength : {
     display name: Minimum Word Length
     type: integer
    }
    minWordsToAugment : {
     display name: Min words to Augment
     type: integer
    }
    maxWordsToAugment : {
     display name: Max words to Augment
     type: integer
    }
    wordPercentageToAugment : {
     display name: Percentage of words to Augment
     type: number
    }
    }

    includeOriginalData - booleanrequired

    When checked original data will be included in the augmented dataset

    Default: true

    type - stringrequired

    Default: argo-data-augmentation

    Allowed values: argo-data-augmentation