Content-Based Recommender Jobs

Use this job when you want to compute item similarities based on their content, such as product descriptions.

First, item content is vectorized; different vectorization methods are available. Then, similar items are selected based on cosine similarity ("nearest neighbor") between their vectors.

At a minimum, you must specify these:

  • An ID for this job

  • The name of the training collection, that is, the collection with your content

  • An output collection; create a separate collection for this

  • The name of the ID field for documents in the training collection, such as item_id_s

  • The names of one or more content fields in the training collection

Note
You can also configure this job to read from or write to cloud storage. See Configure An Argo-Based Job to Access GCS and Configure An Argo-Based Job to Access S3.
Content-based recommendations dataflow

Content-based recommendations dataflow

Note
If using solr as the training data source ensure that the source collection contains the random_* dynamic field defined in its managed-schema. This field is required for sampling the data. If it is not present, add the following entry to the managed-schema alongside other dynamic fields <dynamicField name="random_*" type="random"/> and <fieldType class="solr.RandomSortField" indexed="true" name="random"/> alongside other field types.

Tuning tips

  • Configure Metadata fields for item-item evaluation to use those fields during evaluation to determine whether pairs belong to the same category.

  • Perform approximate nearest neighbor search is enabled by default to significantly reduce the job’s running time, with a small decrease in accuracy. If your training dataset is very small, then you can disable this option.

  • If your content contains a lot of domain-specific jargon, enable Use Word2Vec for vectorization.

  • If your documents are too short or too long, enable Use TF-IDF for vectorization.

Query pipeline setup

Download the APPName_item_item_rec_pipelines_content.json file and import it to create the query pipeline that consumes this job’s output. See Fetch Content-Based Items-for-Item Recommendations for details.