Evaluate generative AI models with an Amazon Nova rubric-based LLM judge on Amazon SageMaker AI (Part 2)
Publish Date: 2026-02-06 11:29:00
Source Domain: aws.amazon.com
- Amazon SageMaker AI introduced a rubric-based large language model (LLM) judge using the Amazon Nova model to evaluate the performance of generative AI systems.
- The rubric-based judge automatically creates specific evaluation criteria for different prompts, adapting to the specific task at hand, and eliminates the need for manually created static rules for each scenario.
- Amazon SageMaker utilizes Amazon Nova’s LLM judge to evaluate the performance of different LLMs across various use cases, from model development to training data quality control and deep dive analyses.
- The process of dynamic rubric generation involves the judge analyzing prompts to generate criteria based on their context and comparing two outputs against these criteria, providing a rationale for its preference.
- The evaluation results delivered by the Amazon Nova rubric-based judge give insights through several metrics and structured YAML outputs, including detailed generated rubrics, Likert scores, justifications, and overall preference labels.
- The Amazon Nova LLM judge employs weighted scores and calibration checks to ensure well-calibrated confidence and consistency of its decisions.
- Use cases for Amazon Nova’s rubric-based evaluation include identifying systematic weaknesses in models and automating evaluation of large numbers of outputs without manual review.
- To utilize Amazon Nova’s LLM-as-a-judge, one needs to prepare datasets, deploy models, and run evaluation jobs on SageMaker AI.
- Results from evaluation jobs are visualized and interpreted, showing preferences, weighted scores, and detailed justifications for the model’s performance metrics.