Agent Evaluation + Regression Harness

v1.0.0 · Last updated 2/17/2026

  • Build eval cases (NDJSON)
  • Design rubric.json + hard_fails.json
  • Run batch grading
  • Diff results across versions
  • Gate release on pass criteria
# Agent Evaluation + Regression Harness

## Overview
Prove the agent still does the job after changes. Eval cases, rubric, batch grading, version diff, release gate.

## Outcomes

- Build eval cases (NDJSON)
- Design rubric.json + hard_fails.json
- Run batch grading
- Diff results across versions
- Gate release on pass criteria