GradeRig is a quality baseline harness for AI agents. Run cohort scenarios, compare against baselines, catch quality drift before your users catch it for you.
Define scenario cohorts that represent your agent's critical paths. Capture the full trace: tool calls, reasoning, outputs. Freeze a known-good run as your baseline.
Run the same cohort against new model versions, prompt changes, or infra updates. GradeRig grades each run against the baseline and surfaces regressions automatically.
Not just final outputs. Every tool call, every reasoning step, every decision branch. When quality drifts, you see exactly where and why the agent diverged.
LLM outputs vary by design. GradeRig uses semantic similarity and structural matching instead of exact comparison. It knows the difference between variation and regression.
Write scenario files that describe what your agent should do. Inputs, expected tool calls, quality criteria. YAML or JSON, version-controlled alongside your agent code.
Run your scenarios once with your current production config. GradeRig captures the full execution trace and freezes it as your quality baseline. This is your known-good state.
Before shipping a model swap, prompt edit, or tool update, run the cohort again. GradeRig grades the new run against the baseline and tells you exactly what regressed, what improved, and what held steady.
GradeRig replaces faith with measurement. Know exactly what your agents do, how well they do it, and when that changes.