Data Augmentation Job

Table of Contents

The Benefits of Augmentation Tasks
Task Description

Use this job to augment training and/or testing data for use with other jobs, such as Smart Answer, Classification, Recommender.

This job takes in data specified by the user, performs one or more of the specified augmentation tasks, and writes the output back to Solr or Cloud.

The Benefits of Augmentation Tasks

The augmentation tasks can improve the models trained on it by adding the augmented data back into the model, thereby increasing the quantity of training data when there isn’t enough. They can also allow you to test the robustness of the models by training them on the source data, then testing them on augmented data. Both introduce variation that will make the models better equipped to handle different types of text. For more details on this process, see Data Augmentation.

The amount of extra augmented data generated will depend on the task and the parameters used. In an ideal scenario with one task applied to one field and little to no record filtering, you can expect to double the amount of the original data.

Task Description

Each task supports a variety of languages. Refer to the description of each task for details.

Backtranslation

Translates the input data into one or more intermediate languages before translating it back to the source language. The process introduces changes in the syntax and grammar of the input text without changing the semantics. Because this task uses a deep learning model, Facebook’s M2M-100, to perform translations, a GPU is recommended for fast processing.

If the backtranslation is of poor quality, try increasing the beam size. However, this will consume more memory and take more time. You could also try changing the intermediate languages to use languages that are similar to each other. For example, if your source language is Korean, translating to Chinese and/or Japanese and back might give you better results than translating to Spanish.

Use the synonym substitution job as an alternative if you’re unable to provision the necessary hardware and/or this job is taking too long. Note that the synonym substitution job does not support the same languages.

Supported Languages: Chinese, Dutch, English, French, German, Hebrew, Italian, Japanese, Korean, Polish, Spanish, Ukrainian

Backtranslation with Korean text may result in errors if run on GKE with Kubernetes master version v1.16.15-gke.4901, Kernel version: 4.19.112+, and Container runtime version: docker://19.3.1 on Google’s Container Optimized OS for Docker. To resolve, upgrade to a higher version of K8s master, kernel, and container runtime.

Synonym Substitution

Takes in the input text and substitutes some words with synonyms derived from the included wordner/ppdb dictionaries or user-supplied dictionaries. The user-supplied dictionaries must be submitted in the lucene/solr synonym format as shown in the example below.

Example synonyms.txt file:

#some test synonym mappings unlikely to appear in real input text
aaa => aaaa
bbb => bbbb1 bbbb2
ccc => cccc1,cccc2
a\=>a => b\=>b
a\,a => b\,b
fooaaa,baraaa,bazaaa

# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.

# Synonym mappings can be used for spelling correction too
pixima => pixma

Supported Languages: Chinese, Dutch, English, French, German, Hebrew, Italian, Japanese, Polish, Spanish

Boosted synonyms are not supported. This synonym mapping file should be uploaded to the blob store, and a blob store path should then be passed to the job.

Keystroke Misspelling

Simulates typos one might make based on the layout of the keyboard. For example, if typing in English on a QWERTY keyboard layout, they might accidentally replace the “y” with a “t” while typing the word “keyboard” because ”y” and “t” are next to each other on the keyboard. Currently, only QWERTY keyboard layouts are supported.

The user can provide their own keyboard mapping as a JSON file uploaded to the fusion blob store. The JSON file should be in the following format: {“a”:”x”, “b”:”v”, …}.

Supported Languages: Dutch, English, French, German, Hebrew, Italian, Polish, Spanish, Ukrainian
Split word

Randomly splits words by introducing a space “ “ at some random point in the word.

Supported Languages: Dutch, English, French, German, Italian, Polish, Spanish

Use this job to perform Text Augmentation

id - stringrequired

The ID for this job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

sparkConfig - array[object]

Provide additional key/value pairs to be injected into the training JSON map at runtime. Values will be inserted as-is, so use " to surround string values

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

writeOptions - array[object]

Options used when writing output to Solr or other sources

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

readOptions - array[object]

Options used when reading input from Solr or other sources.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

trainingCollection - stringrequired

Solr collection or cloud storage path where training data is present.

>= 1 characters

trainingFormat - stringrequired

The format of the training data - solr, parquet etc.

>= 1 characters

trainingDataFilterQuery - string

Solr or SQL query to filter training data. Use solr query when solr collection is specified in Training Path. Use SQL query when cloud storage location is specified. The table name for SQL is `spark_input`

randomSeed - integer

Pseudorandom determinism fixed by keeping this seed constant

Default: 12345

trainingSampleFraction - number

Choose a fraction of the data for training.

<= 1

exclusiveMaximum: false

Default: 1

batchSize - string

If writing to solr, this field defines the batch size for documents to be pushed to solr.

Default: 15000

outputCollection - stringrequired

Output collection to store generated augmented data.

>= 1 characters

outputFormat - stringrequired

The format of the output data - solr, parquet etc.

>= 1 characters

partitionFields - string

If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

secretName - string

Name of the secret used to access cloud storage as defined in the K8s namespace

>= 1 characters

backTranslations - array[object]

Augment data via translation to a different language and then translating back to original language. Chain of languages can be used for translation. Works at sentence level for medium-long length text. GPU recommended and will be used when available.

object attributes:{fieldname required : {
display name: Field Name
type: string
}inputLanguage required : {
display name: Input data Language
type: string
}intermediateLanguage : {
display name: Intermediate Language
type: string
}batchSize : {
display name: Batch Size
type: integer
}beamSize : {
display name: Beam Size
type: integer
}minSentenceLength : {
display name: Min translation length (tokens)
type: integer
}maxSentenceLength : {
display name: Max translation length (tokens)
type: integer
}}

keyStrokeMisspellings - array[object]

Augment data via insertion, substitution, swapping and deletion of characters based on keyboard layout. Useful for short text.

object attributes:{fieldname required : {
display name: Field Name
type: string
}inputLanguage required : {
display name: Input data Language
type: string
}minCharAugment : {
display name: Minimum Chars to Augment
type: integer
}maxCharAugment : {
display name: Maximum Chars to Augment
type: integer
}minWordsToAugment : {
display name: Min words to Augment
type: integer
}maxWordsToAugment : {
display name: Max words to Augment
type: integer
}wordPercentageToAugment : {
display name: Percentage words to Augment
type: number
}keywordsBlobName : {
display name: Keystroke Mapping
type: string
}}

synonymSubstitutions - array[object]

Augment data via substituting words using synonyms from wordnet or user supplied dictionary. Useful for short, medium and long text. Faster and less resource intensive than back translation.

object attributes:{fieldname required : {
display name: Field Name
type: string
}inputLanguage required : {
display name: Input data Language
type: string
}minWordsToAugment : {
display name: Min words to Augment
type: integer
}maxWordsToAugment : {
display name: Max words to Augment
type: integer
}wordPercentageToAugment : {
display name: Percentage of words to Augment
type: number
}stopwordsBlobName : {
display name: Synonym Dictionary Name
type: string
}}

splitWords - array[object]

Augment data via splitting some words. Useful for short, medium and long text.

object attributes:{fieldname required : {
display name: Field Name
type: string
}inputLanguage required : {
display name: Input data Language
type: string
}minWordLength : {
display name: Minimum Word Length
type: integer
}minWordsToAugment : {
display name: Min words to Augment
type: integer
}maxWordsToAugment : {
display name: Max words to Augment
type: integer
}wordPercentageToAugment : {
display name: Percentage of words to Augment
type: number
}}

includeOriginalData - booleanrequired

When checked original data will be included in the augmented dataset

Default: true

type - stringrequired

Default: argo-data-augmentation

Allowed values: argo-data-augmentation