Product Selector

Fusion 5.12
    Fusion 5.12

    Content-Based Recommender Jobs (Experimental)

    Use this job when you want to compute item similarities based on their content, such as product descriptions.

    Default job name

    COLLECTION_NAME_content_recs

    Input

    Searchable content from the primary collection.

    Output

    Items-for-item recommendations (the COLLECTION_NAME_content_recs collection by default)

    First, item content is vectorized; different vectorization methods are available. Then, similar items are selected based on cosine similarity ("nearest neighbor") between their vectors.

    At a minimum, you must specify these:

    • An ID for this job

    • The name of the training collection, that is, the collection with your content

    • An output collection; create a separate collection for this

    • The name of the ID field for documents in the training collection, such as item_id_s

    • The names of one or more content fields in the training collection

    Content-based recommendations dataflow

    Content-based recommendations dataflow

    Tuning tips

    • Configure Metadata fields for item-item evaluation to use those fields during evaluation to determine whether pairs belong to the same category.

    • Perform approximate nearest neighbor search is enabled by default to significantly reduce the job’s running time, with a small decrease in accuracy. If your training dataset is very small, then you can disable this option.

    • If your content contains a lot of domain-specific jargon, enable Use Word2Vec for vectorization.

    • If your documents are too short or too long, enable Use TF-IDF for vectorization.

    Query pipeline setup

    Download the APPName_item_item_rec_pipelines_content.json file and import it to create the query pipeline that consumes this job’s output. See Fetch Content-Based Items-for-Item Recommendations for details.

    Use this job when you want to compute item similarities based on their content such as product descriptions.

    id - stringrequired

    The ID for this job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Provide additional key/value pairs to be injected into the training JSON map at runtime. Values will be inserted as-is, so use " to surround string values

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    outputBatchSize - string

    Batch size of documents when pushing results to solr

    Default: 15000

    unidecodeText - boolean

    Select if you want the text to be unidecoded.

    Default: true

    lowercaseText - boolean

    Select if you want the text to be lowercased.

    Default: true

    vectorizationUseDl - boolean

    Select if you want to use deep learning as the method for vectorization. You can choose the other methods too in which case an ensemble will be used.

    Default: true

    vectorizationUseFasttext - boolean

    Select if you want to use word2vec as the method for vectorization. You can choose the other methods too in which case an ensemble will be used. Custom embeddings will be learned. Useful for jargon.

    vectorizationUseTfidf - boolean

    Select if you want to use Tf-idf as the method for vectorization. You can choose the other methods too in which case an ensemble will be used.

    vectorizationDlEnsembleWeight - number

    Ensemble weight for deep learning based vectorization if more than one method of vectorization is selected.

    Default: 1

    vectorizationFasttextVectorsSize - integer

    Word vector dimensions for Word2Vec vectorizer.

    >= 1

    exclusiveMinimum: false

    Default: 150

    vectorizationFasttextWindowSize - integer

    The window size (context words from [-window, window]) for Word2Vec.

    >= 1

    exclusiveMinimum: false

    Default: 5

    vectorizationFasttextEpochs - integer

    Number of epochs to train custom Word2Vec embeddings.

    >= 1

    exclusiveMinimum: false

    Default: 15

    vectorizationFasttextMaxVocabSize - integer

    Maximum number of tokens to consider for the vocab. Less frequent tokens will be omitted.

    >= 1

    exclusiveMinimum: false

    vectorizationFasttextEnsembleWeight - number

    Ensemble weight for Fasttext based vectorization if more than one method of vectorization is selected.

    Default: 1

    vectorizationTfidfUseCharacters - boolean

    Whether to use characters. By default words are used.

    vectorizationTfidfFilterStopwords - boolean

    Whether to filter out stopwords before generating Tf-Idf weights.

    Default: true

    vectorizationTfidfMinNgram - integer

    Minimum Ngram size to be used.

    >= 1

    exclusiveMinimum: false

    Default: 1

    vectorizationTfidfMaxNgram - integer

    Maximum Ngram size to be used.

    >= 1

    exclusiveMinimum: false

    Default: 3

    vectorizationTfIdfMaxVocabSize - integer

    Maximum number of tokens to consider for the vocab. Less frequent tokens will be omitted.

    >= 1

    exclusiveMinimum: false

    vectorizationTfidfEnsembleWeight - number

    Ensemble weight for Tf-Idf based vectorization if more than one method of vectorization is selected.

    Default: 1

    topKAnn - integer

    This is used to fetch additional recommendations so that the value specified for the Number of User Recommendations to Compute is most likely satisfied after filtering. This is normally set to 10 * (No. of item recommendations to compute)

    >= 1

    exclusiveMinimum: false

    Default: 100

    jobRunName - string

    Identifier for this job run. Use it to filter recommendations from particular runs

    trainingCollection - stringrequired

    Solr collection or cloud storage path where training data is present.

    >= 1 characters

    trainingFormat - stringrequired

    The format of the training data - solr, parquet etc.

    >= 1 characters

    Default: solr

    secretName - string

    Name of the secret used to access cloud storage as defined in the K8s namespace

    >= 1 characters

    outputCollection - stringrequired

    Solr collection or cloud storage path where output data is to be written.

    outputFormat - stringrequired

    The format of the output data - solr, parquet etc.

    >= 1 characters

    Default: solr

    partitionFields - string

    If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

    numSimsPerItem - integer

    Number of recommendations that will be saved per item.

    >= 1

    exclusiveMinimum: false

    Default: 10

    deleteOldRecs - boolean

    Should previous recommendations be deleted. If this box is unchecked, then old recommendations will not be deleted but new recommendations will be appended with a different Job ID. Both sets of recommendations will be contained within the same collection. Will only work when output path is solr.

    Default: true

    excludeFromDeleteFilter - string

    If the 'Delete Old Recommendations' flag is enabled, then use this query filter to identify existing recommendation docs to exclude from delete. The filter should identify recommendation docs you want to keep.

    metadataCategoryFields - array[string]

    These fields will be used for item-item evaluation and for determining if the recommendation pair belongs to the same category.

    trainingDataFilterQuery - string

    Solr or SQL query to filter training data. Use solr query when solr collection is specified in Training Path. Use SQL query when cloud storage location is specified. The table name for SQL is `spark_input`.

    trainingSampleFraction - number

    Choose a fraction of the data for training.

    <= 1

    exclusiveMaximum: false

    Default: 1

    itemIdField - stringrequired

    Field name containing stored item ids

    >= 1 characters

    Default: item_id_s

    contentField - array[string]required

    Field name containing item content such as product description

    randomSeed - integer

    Pseudorandom determinism fixed by keeping this seed constant

    Default: 12345

    itemMetadataFields - array[string]

    List of item metadata fields to include in the recommendation output documents.

    vectorizationDlBatchSize - integer

    Compute encodings in batches in case hardware out of memory.

    >= 1

    exclusiveMinimum: false

    performANN - boolean

    Whether to perform approximate nearest neighbor search (ANN). ANN will drastically reduce training time, but accuracy will drop a little. Disable only if dataset is very small.

    Default: true

    maxNeighbors - integer

    If perform ANN, size of the potential neighbors for the indexing phase. Higher value leads to better recall and shorter retrieval times (at the expense of longer indexing time).Reasonable range: 5~100

    >= 5

    <= 100

    exclusiveMinimum: false

    exclusiveMaximum: false

    searchNN - integer

    If perform ANN, the depth of search used to find neighbors. Higher value improves recall at the expense of longer retrieval time.Reasonable range: 100~2000

    >= 100

    <= 2000

    exclusiveMinimum: false

    exclusiveMaximum: false

    indexNN - integer

    If perform ANN, the depth of constructed index. Higher value improves recall at the expense of longer indexing time.Reasonable range: 100~2000

    >= 100

    <= 2000

    exclusiveMinimum: false

    exclusiveMaximum: false

    type - stringrequired

    Default: argo-item-recommender-content

    Allowed values: argo-item-recommender-content