Classify New Documents at Index Time

You can predict the categories of new documents at index time by using the Classification job to analyze previously-classified documents in your index and produce a training model, then referencing the model in the Machine Learning index stage.

Classification job dataflow (documents)

Document classification job dataflow

How to configure new document classification

  1. Navigate to Collections > Jobs > Add+ > Classification to create a new Classification job.

  2. Configure the job as follows:

    1. In the Model Deployment Name field, enter an ID for the new classification model.

    2. In the Training Data Path field, enter the collection name or cloud storage path where your main content is stored.

    3. In the Training Data Format field, leave the default solr value if the Training Data Path is a collection. Otherwise, specify the format of your data in cloud storage.

    4. In the Training collection content field, enter the name of the field that contains the content to analyze.

      The content field that you choose depends on your use case and the types of queries that your users commonly make.

      For example, you could choose the description field if users tend to make descriptive queries like "4k TV" or "soft waterproof jacket". But if users are more likely to search for specific brands or products, such as "LG TV" or "North Face jacket", then the product name field might be more suitable.

    5. In the Training collection class field, enter the name of the field that contains the category data.

      Tip
      For additional configuration details, see Best practices below.
  3. Save the job.

  4. Specify the model’s name in the Machine Learning stage of your index pipeline.

  5. In the Model input transformation script field, enter the following:

    /*
    Name of the document field to feed into the model.
    */
    var documentFeatureField = "body_t"
    
    /*
    Model input construction.
    */
    var modelInput = new java.util.HashMap()
    modelInput.put("text", doc.getFirstFieldValue(documentFeatureField))
    modelInput
  6. In the Model output transformation script field, enter the following:

    // In case if top_k_predictions are needed
    var top1ClassField = "top_1_class_s"
    var top1ScoreField = "top_1_score_d"
    var topKClassesField = "top_k_classes_ss"
    var topKScoresField = "top_k_scores_ds"
    
    var jsonOutput = JSON.parse(modelOutput.get("_rawJsonResponse"))
    var parsedOutput = {};
    for (var i=0; i<jsonOutput["names"].length;i++){
      parsedOutput[jsonOutput["names"][i]] = jsonOutput["ndarray"][i]
    }
    
    doc.addField(top1ClassField, parsedOutput["top_1_class"][0])
    doc.addField(top1ScoreField, parsedOutput["top_1_score"][0])
    if ("top_k_classes" in parsedOutput) {
        doc.addField(topKClassesField, new java.util.ArrayList(parsedOutput["top_k_classes"][0]))
        doc.addField(topKScoresField, new java.util.ArrayList(parsedOutput["top_k_scores"][0]))
    }
    1. Click Apply.

  7. Save the query pipeline.

Custom output transformation script example

var top1ClassField = "top_1_class_s"
var top1ScoreField = "top_1_score_d"

doc.addField(top1ClassField, modelOutput.get("top_1_class")[0])
doc.addField(top1ScoreField, modelOutput.get("top_1_score")[0])

Best practices for configuring the Classification job

These tips describe how to tune the options under Vectorization Parameters for best results with different use cases.

Query intent / short texts

If you want to train a model to predict query intents or to do short text classification, then enable Use Characters.

Another vectorization parameter that can improve model quality is Max Ngram size, with reasonable defaults between 3 and 5.

The more character ngrams are used the bigger the vocabulary, so it is worthwhile to tune the Maximum Vocab Size parameter that controls how many unique tokens will be used. Lower values will make training faster and will prevent overfitting but might provide lower quality too. It’s important to find a good balance.

Activating the advanced Sublinear TF option usually helps if characters are used.

Documents / long texts

If you want to train a model to predict classes for documents or long texts like one or more paragraphs, then uncheck Use Characters.

The reasonable values for word-based Max Ngram size are 2–3. Be sure to tune Maximum Vocab Size parameter too. Usually it’s better to leave the advanced Sublinear TF option deactivated.

Performance tuning

If the text is very long and Use Characters is checked, the job may take a lot of memory and possibly fail if the amount of memory requested by the job is not available. This may result in pods being evicted or failing with OOM errors. If you see this happening, try the following:

  • Uncheck Use Characters.

  • Reduce the vocabulary size and ngram range of the documents.

  • Allocate more memory to the pod.

Algorithm-specific

If you are going to train a model via LogisticRegression algorithm, dimensionality reduction usually doesn’t help so it makes sense to leave Reduce Dimensionality unchecked. But scaling seems to improve results, so it’s suggested to activate Scale Features.

For models trained by StarSpace algorithm it’s vice-versa. Dimensionality reduction usually helps to get better results as well as much faster model training. But scaling usually doesn’t help or might make results a little bit worse.