Baseline. Measure. Grade.

Your agents ship fast.
Who checks if they still work?

GradeRig is a quality baseline harness for AI agents. Run cohort scenarios, compare against baselines, catch quality drift before your users catch it for you.

graderig run --cohort onboarding-v3

$ graderig run --cohort onboarding-v3 --baseline 2026-05-01

────────────────────────────────────────────────

Running 12 scenarios against baseline...

PASS email_welcome_flow ............ 98.2% match

PASS task_creation_mvp ............ 96.7% match

DRIFT landing_page_quality ........ 84.1% match (-11.3%)

PASS competitor_research .......... 95.4% match

FAIL cold_outreach_tone .......... 62.3% match (-33.1%)

PASS tool_call_accuracy .......... 99.1% match

────────────────────────────────────────────────

Cohort score: 89.3% (baseline: 94.7%)

2 regressions detected. Report saved.

Cohort Baselines

Define scenario cohorts that represent your agent's critical paths. Capture the full trace: tool calls, reasoning, outputs. Freeze a known-good run as your baseline.

Drift Detection

Run the same cohort against new model versions, prompt changes, or infra updates. GradeRig grades each run against the baseline and surfaces regressions automatically.

Full Trace Capture

Not just final outputs. Every tool call, every reasoning step, every decision branch. When quality drifts, you see exactly where and why the agent diverged.

Non-Determinism Aware

LLM outputs vary by design. GradeRig uses semantic similarity and structural matching instead of exact comparison. It knows the difference between variation and regression.

How it works

Define your scenarios

Write scenario files that describe what your agent should do. Inputs, expected tool calls, quality criteria. YAML or JSON, version-controlled alongside your agent code.

Capture a baseline

Run your scenarios once with your current production config. GradeRig captures the full execution trace and freezes it as your quality baseline. This is your known-good state.

Grade every change

Before shipping a model swap, prompt edit, or tool update, run the cohort again. GradeRig grades the new run against the baseline and tells you exactly what regressed, what improved, and what held steady.

Agents that ship without baselines ship on faith alone.

GradeRig replaces faith with measurement. Know exactly what your agents do, how well they do it, and when that changes.

Your agents ship fast.Who checks if they still work?