Use this job when you want to compute item similarities based on their content such as product descriptions.
id - stringrequired
The ID for this job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)
<= 63 characters
Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?
sparkConfig - array[object]
Provide additional key/value pairs to be injected into the training JSON map at runtime. Values will be inserted as-is, so use " to surround string values
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
writeOptions - array[object]
Options used when writing output to Solr or other sources
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
readOptions - array[object]
Options used when reading input from Solr or other sources.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
outputBatchSize - string
Batch size of documents when pushing results to solr
Default: 15000
unidecodeText - boolean
Select if you want the text to be unidecoded.
Default: true
lowercaseText - boolean
Select if you want the text to be lowercased.
Default: true
vectorizationUseDl - boolean
Select if you want to use deep learning as the method for vectorization. You can choose the other methods too in which case an ensemble will be used.
Default: true
vectorizationUseFasttext - boolean
Select if you want to use word2vec as the method for vectorization. You can choose the other methods too in which case an ensemble will be used. Custom embeddings will be learned. Useful for jargon.
vectorizationUseTfidf - boolean
Select if you want to use Tf-idf as the method for vectorization. You can choose the other methods too in which case an ensemble will be used.
vectorizationDlEnsembleWeight - number
Ensemble weight for deep learning based vectorization if more than one method of vectorization is selected.
Default: 1
vectorizationFasttextVectorsSize - integer
Word vector dimensions for Word2Vec vectorizer.
>= 1
exclusiveMinimum: false
Default: 150
vectorizationFasttextWindowSize - integer
The window size (context words from [-window, window]) for Word2Vec.
>= 1
exclusiveMinimum: false
Default: 5
vectorizationFasttextEpochs - integer
Number of epochs to train custom Word2Vec embeddings.
>= 1
exclusiveMinimum: false
Default: 15
vectorizationFasttextMaxVocabSize - integer
Maximum number of tokens to consider for the vocab. Less frequent tokens will be omitted.
>= 1
exclusiveMinimum: false
vectorizationFasttextEnsembleWeight - number
Ensemble weight for Fasttext based vectorization if more than one method of vectorization is selected.
Default: 1
vectorizationTfidfUseCharacters - boolean
Whether to use characters. By default words are used.
vectorizationTfidfFilterStopwords - boolean
Whether to filter out stopwords before generating Tf-Idf weights.
Default: true
vectorizationTfidfMinNgram - integer
Minimum Ngram size to be used.
>= 1
exclusiveMinimum: false
Default: 1
vectorizationTfidfMaxNgram - integer
Maximum Ngram size to be used.
>= 1
exclusiveMinimum: false
Default: 3
vectorizationTfIdfMaxVocabSize - integer
Maximum number of tokens to consider for the vocab. Less frequent tokens will be omitted.
>= 1
exclusiveMinimum: false
vectorizationTfidfEnsembleWeight - number
Ensemble weight for Tf-Idf based vectorization if more than one method of vectorization is selected.
Default: 1
topKAnn - integer
This is used to fetch additional recommendations so that the value specified for the Number of User Recommendations to Compute is most likely satisfied after filtering. This is normally set to 10 * (No. of item recommendations to compute)
>= 1
exclusiveMinimum: false
Default: 100
jobRunName - string
Identifier for this job run. Use it to filter recommendations from particular runs
trainingCollection - stringrequired
Solr collection or cloud storage path where training data is present.
>= 1 characters
trainingFormat - stringrequired
The format of the training data - solr, parquet etc.
>= 1 characters
Default: solr
secretName - string
Name of the secret used to access cloud storage as defined in the K8s namespace
>= 1 characters
outputCollection - stringrequired
Solr collection or cloud storage path where output data is to be written.
outputFormat - stringrequired
The format of the output data - solr, parquet etc.
>= 1 characters
Default: solr
partitionFields - string
If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output
numSimsPerItem - integer
Number of recommendations that will be saved per item.
>= 1
exclusiveMinimum: false
Default: 10
deleteOldRecs - boolean
Should previous recommendations be deleted. If this box is unchecked, then old recommendations will not be deleted but new recommendations will be appended with a different Job ID. Both sets of recommendations will be contained within the same collection. Will only work when output path is solr.
Default: true
excludeFromDeleteFilter - string
If the 'Delete Old Recommendations' flag is enabled, then use this query filter to identify existing recommendation docs to exclude from delete. The filter should identify recommendation docs you want to keep.
metadataCategoryFields - array[string]
These fields will be used for item-item evaluation and for determining if the recommendation pair belongs to the same category.
trainingDataFilterQuery - string
Solr or SQL query to filter training data. Use solr query when solr collection is specified in Training Path. Use SQL query when cloud storage location is specified. The table name for SQL is `spark_input`.
trainingSampleFraction - number
Choose a fraction of the data for training.
<= 1
exclusiveMaximum: false
Default: 1
itemIdField - stringrequired
Field name containing stored item ids
>= 1 characters
Default: item_id_s
contentField - array[string]required
Field name containing item content such as product description
randomSeed - integer
Pseudorandom determinism fixed by keeping this seed constant
Default: 12345
itemMetadataFields - array[string]
List of item metadata fields to include in the recommendation output documents.
vectorizationDlBatchSize - integer
Compute encodings in batches in case hardware out of memory.
>= 1
exclusiveMinimum: false
performANN - boolean
Whether to perform approximate nearest neighbor search (ANN). ANN will drastically reduce training time, but accuracy will drop a little. Disable only if dataset is very small.
Default: true
maxNeighbors - integer
If perform ANN, size of the potential neighbors for the indexing phase. Higher value leads to better recall and shorter retrieval times (at the expense of longer indexing time).Reasonable range: 5~100
>= 5
<= 100
exclusiveMinimum: false
exclusiveMaximum: false
searchNN - integer
If perform ANN, the depth of search used to find neighbors. Higher value improves recall at the expense of longer retrieval time.Reasonable range: 100~2000
>= 100
<= 2000
exclusiveMinimum: false
exclusiveMaximum: false
indexNN - integer
If perform ANN, the depth of constructed index. Higher value improves recall at the expense of longer indexing time.Reasonable range: 100~2000
>= 100
<= 2000
exclusiveMinimum: false
exclusiveMaximum: false
type - stringrequired
Default: argo-item-recommender-content
Allowed values: argo-item-recommender-content