Use this job when you want to compute item similarities based on their content, such as product descriptions.
First, item content is vectorized; different vectorization methods are available. Then, similar items are selected based on cosine similarity ("nearest neighbor") between their vectors.
At a minimum, you must specify these:
-
An ID for this job
-
The name of the training collection, that is, the collection with your content
-
An output collection; create a separate collection for this
-
The name of the ID field for documents in the training collection, such as
item_id_s
-
The names of one or more content fields in the training collection
Note
|
You can also configure this job to read from or write to cloud storage. See Configure An Argo-Based Job to Access GCS and Configure An Argo-Based Job to Access S3. |
Note
|
If using solr as the training data source ensure that the source collection contains the random_* dynamic field defined in its managed-schema. This field is required for sampling the data. If it is not present, add the following entry to the managed-schema alongside other dynamic fields <dynamicField name="random_*" type="random"/> and <fieldType class="solr.RandomSortField" indexed="true" name="random"/> alongside other field types.
|
Tuning tips
-
Configure Metadata fields for item-item evaluation to use those fields during evaluation to determine whether pairs belong to the same category.
-
Perform approximate nearest neighbor search is enabled by default to significantly reduce the job’s running time, with a small decrease in accuracy. If your training dataset is very small, then you can disable this option.
-
If your content contains a lot of domain-specific jargon, enable Use Word2Vec for vectorization.
-
If your documents are too short or too long, enable Use TF-IDF for vectorization.
Query pipeline setup
Download the APPName_item_item_rec_pipelines_content.json
file and import it to create the query pipeline that consumes this job’s output. See Fetch Content-Based Items-for-Item Recommendations for details.