BPR Recommender Jobs

Table of Contents

Tuning tips
Query pipeline setup
Configuration properties

Use this job when you want to compute user recommendations or item similarities using a Bayesian Personalized Ranking (BPR) recommender algorithm.

The ALS recommender job is deprecated. Use this recommender job instead.

Default job name

COLLECTION_NAME_bpr_item_recs

Input

Aggregated signals (the COLLECTION_NAME_recs_aggr collection by default)

Output

Items-for-item recommendations (the COLLECTION_NAME_bpr_item_recs collection by default)
Items-for-user recommendations (the COLLECTION_NAME_bpr_user_recs collection by default)

query

count_i

type

timstamp_tdt

user_id

doc_id

session_id

fusion_query_id

Required signals fields:

This job assumes that your signals collection contains the preferences of many users. It uses this collection of preferences to predict another user’s preference for an item that the user has not yet seen:

User. Use Training Collection User Id Field to specify the name of the user ID field, usually user_id_s.
Item. Use Training Collection Item Id Field to specify the name of the item ID field, usually item_id_s.
Interaction-value. Use Training Collection Counts/Weights Field to specify the name of the interaction value field, usually aggr_count_i.

Compared to ALS-based recommenders, BPR-based recommenders compare a pair of recommendations for a user instead of static 0, 1 input-based recommendations as in ALS.

You can also configure this job to read from or write to cloud storage. See Configure An Argo-Based Job to Access GCS and Configure An Argo-Based Job to Access S3.

BPR collaborative recommendations dataflow

BPR dataflow

If using Solr as the training data source, ensure that the source collection contains the random_* dynamic field defined in its managed-schema. This field is required for sampling the data. If it is not present, add the following entry to the managed-schema alongside other dynamic fields <dynamicField name="random_*" type="random"/> and <fieldType class="solr.RandomSortField" indexed="true" name="random"/> alongside other field types.

Tuning tips

The BPR Recommender job has a few unique tuning parameters compared to the ALS Recommender job:

Training Data Filtered By Popular Items

By setting the minimum number of user interactions required for items to be included in training and recommendations, you can suppress items that do not yet have enough signals data for meaningful recommendations.
Filter already clicked items

This feature produces only "fresh" recommendations, by omitting items the user has already clicked. (It also increases the job’s running time.)
Perform approximate nearest neighbor search

This option reduces the job’s running time significantly, with a small decrease in accuracy. If your training dataset is very small, then you can disable this option.
Evaluate on test data

This feature samples the original dataset to evaluate how well the trained model predicts unseen user interactions. The clicks that are sampled for testing are not used for training. For example, with the default configuration, users who have at least three total clicks are selected for testing. For each of those users, one click is used for testing and the rest are used for training. The trained model is applied to the test data, and the evaluation results are written to the log.
Metadata fields for item-item evaluation

These fields are used during evaluation to determine whether pairs belong to the same category.

Query pipeline setup

For items-for-item recommendations, download the APPName_item_item_rec_pipelines_bpr.json file and import it to create the query pipeline that consumes this job’s output. See Fetch Items-for-Item Recommendations (Collaborative/BPR Method) for details.
For items-for-user recommendations, download the APPName_item_user_rec_pipelines_bpr.json file and import it to create the query pipeline that consumes this job’s output. See Fetch Items-for-User Recommendations (Collaborative/BPR Method) for details.

Configuration properties

Use this job when you want to compute user recommendations or item similarities using a Bayesian Personalized Ranking recommender. You can also implement a user-to-item recommender in the advanced section of this job’s configuration UI.

id - stringrequired

The ID for this job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

sparkConfig - array[object]

Provide additional key/value pairs to be injected into the training JSON map at runtime. Values will be inserted as-is, so use " to surround string values

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

writeOptions - array[object]

Options used when writing output to Solr or other sources

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

readOptions - array[object]

Options used when reading input from Solr or other sources.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

outputBatchSize - string

Batch size of documents when pushing results to solr

Default: 15000

jobRunName - string

Identifier for this job run. Use it to filter recommendations from particular runs.

trainingCollection - stringrequired

Solr collection or cloud storage path where training data is present.

>= 1 characters

trainingFormat - stringrequired

The format of the training data - solr, parquet etc.

>= 1 characters

Default: solr

secretName - string

Name of the secret used to access cloud storage as defined in the K8s namespace

>= 1 characters

outputUserRecsCollection - string

Solr collection or cloud storage path to store batch-predicted user/item recommendations (if absent, none computed). Specify at least one of Items-Users Output Collection or Items-Items Output Collection.

>= 1 characters

outputItemSimCollection - string

Solr collection or cloud storage path to store batch-computed item/item similarities (if absent, none computed). Specify at least one of Items-Users Output Collection or Items-Items Output Collection.

>= 1 characters

outputFormat - stringrequired

The format of the output data - solr, parquet etc.

>= 1 characters

Default: solr

partitionFields - string

If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output.

numRecsPerUser - integer

Number of recommendations that will be saved per user.

exclusiveMinimum: false

Default: 10

userTopkAnn - integer

Applies only when Filter Already Clicked Items is enabled. This is used to fetch additional recommendations so that the value specified for the Number of Recommendations Per User is most likely satisfied with filtering turned on.

exclusiveMinimum: false

numSimsPerItem - integer

Number of recommendations that will be saved per item.

exclusiveMinimum: false

Default: 10

deleteOldRecs - boolean

Should previous recommendations be deleted. If this box is unchecked, then old recommendations will not be deleted but new recommendations will be appended with a different Job ID. Both sets of recommendations will be contained within the same collection. Will only work when output path is solr.

Default: true

excludeFromDeleteFilter - string

If the 'Delete Old Recommendations' flag is enabled, then use this query filter to identify existing recommendation docs to exclude from delete. The filter should identify recommendation docs you want to keep.

filterClicked - boolean

Whether to filter out already clicked items in item recommendations for user. Takes more time but drastically improves quality.

Default: true

weightField - string

Solr field name containing stored counts/weights the user has for that item. This field is used as weight during training

Default: aggr_count_i

trainingDataFilterQuery - string

Solr or SQL query to filter training data. Use solr query when solr collection is specified in Training Path. Use SQL query when cloud storage location is specified. The table name for SQL is `spark_input`.

trainingSampleFraction - number

Choose a fraction of the data for training.

<= 1

exclusiveMaximum: false

Default: 1

userIdField - stringrequired

Solr field name in the training collection that contains stored User ID.

>= 1 characters

Default: user_id_s

itemIdField - stringrequired

Solr field name in the training collection that contains stored Item ID.

>= 1 characters

Default: item_id_s

randomSeed - integer

Pseudorandom determinism fixed by keeping this seed constant

Default: 12345

itemMetadataFields - array[string]

List of item metadata fields to include in the recommendation output documents. WARNING: Adding many fields can lead to huge output sizes or OOM issues.

itemMetadataCollection - string

Cloud storage path or Solr collection containing item metadata fields you want to add to the recommendation output documents. Leave blank and fill in the metadata fields if you want to fetch data from the training collection. Join field needs to be specified.

itemMetadataFormat - string

The format of the metadata - solr, parquet etc.

>= 1 characters

Default: solr

itemMetadataJoinField - string

Name of field in the item metadata collection to join on.

performANN - boolean

Whether to perform approximate nearest neighbor search (ANN). ANN will drastically reduce training time, but accuracy will drop a little. Disable only if training dataset is very small.

Default: true

maxNeighbors - integer

If perform ANN, size of the potential neighbors for the indexing phase. Higher value leads to better recall and shorter retrieval times (at the expense of longer indexing time).Reasonable range: 5~100

>= 100

<= 2000

exclusiveMinimum: false

exclusiveMaximum: false

searchNN - integer

If perform ANN, the depth of search used to find neighbors. Higher value improves recall at the expense of longer retrieval time.Reasonable range: 100~2000

>= 100

<= 2000

exclusiveMinimum: false

exclusiveMaximum: false

indexNN - integer

If perform ANN, the depth of constructed index. Higher value improves recall at the expense of longer indexing time.Reasonable range: 100~2000

>= 100

<= 2000

exclusiveMinimum: false

exclusiveMaximum: false

factors - integer

Latent factor dimension used for matrix decomposition. Bigger values require more time and memory but usually provide better results.

>= 1

exclusiveMinimum: false

Default: 100

epochs - integer

Number of model training iterations. Model will converge better with larger number at the expense of increased training time. For bigger datasets use smaller values.

>= 1

exclusiveMinimum: false

Default: 30

learningRate - number

Model learning rate.

Default: 0.05

metadataCategoryFields - array[string]

These fields will be used for item-item evaluation and for determining if the recommendation pair belong to the same category.

minNumItemUniqueClicks - integer

Items must have at least this no. of unique user interactions to be included for training and recommendations. The higher this value, the more popular items selected but the amount of training data will reduce.

>= 1

exclusiveMinimum: false

Default: 2

minNumUserUniqueClicks - integer

Users must have at least this no. of unique item interactions to be included for training and recommendations. The higher this value, the more active users are selected but the amount of training data will reduce.

>= 1

exclusiveMinimum: false

Default: 2

minNumClickedProducts - integer

Minimum number of clicked products the user should have to be a candidate for the test set.

>= 2

exclusiveMinimum: false

Default: 3

maxNumTestUsers - integer

Maximum number of test users to choose. If more users satisfying the Minimum Clicked Products criterion are present, the number will be capped to what is specified here.

exclusiveMinimum: false

Default: 10000

numTestUserClicks - integer

How many test user clicks to use for testing. Should be less than the value for Minimum Clicked Products.

>= 1

exclusiveMinimum: false

Default: 1

doEvaluation - boolean

Evaluate how well the trained model predicts user clicks. Test data will be sampled from original dataset.

type - stringrequired

Default: argo-item-recommender-user

Allowed values: argo-item-recommender-user