Evaluate generative AI models with an Amazon Nova rubric-based LLM judge on Amazon SageMaker AI (Part 2)

Evaluate generative AI models with an Amazon Nova rubric-based LLM judge on Amazon SageMaker AI (Part 2)

Evaluate generative AI models with an Amazon Nova rubric-based LLM judge on Amazon SageMaker AI (Part 2)

https://aws.amazon.com/blogs/machine-learning/evaluate-generative-ai-models-with-an-amazon-nova-rubric-based-llm-judge-on-amazon-sagemaker-ai-part-2/

Publish Date: 2026-02-06 11:29:00

Source Domain: aws.amazon.com

  • Amazon SageMaker AI introduced a rubric-based large language model (LLM) judge using the Amazon Nova model to evaluate the performance of generative AI systems.
  • The rubric-based judge automatically creates specific evaluation criteria for different prompts, adapting to the specific task at hand, and eliminates the need for manually created static rules for each scenario.
  • Amazon SageMaker utilizes Amazon Nova’s LLM judge to evaluate the performance of different LLMs across various use cases, from model development to training data quality control and deep dive analyses.
  • The process of dynamic rubric generation involves the judge analyzing prompts to generate criteria based on their context and comparing two outputs against these criteria, providing a rationale for its preference.
  • The evaluation results delivered by the Amazon Nova rubric-based judge give insights through several metrics and structured YAML outputs, including detailed generated rubrics, Likert scores, justifications, and overall preference labels.
  • The Amazon Nova LLM judge employs weighted scores and calibration checks to ensure well-calibrated confidence and consistency of its decisions.
  • Use cases for Amazon Nova’s rubric-based evaluation include identifying systematic weaknesses in models and automating evaluation of large numbers of outputs without manual review.
  • To utilize Amazon Nova’s LLM-as-a-judge, one needs to prepare datasets, deploy models, and run evaluation jobs on SageMaker AI.
  • Results from evaluation jobs are visualized and interpreted, showing preferences, weighted scores, and detailed justifications for the model’s performance metrics.