Know when your agent broke, not just if it's smart.

Automated quality baselines for AI agent execution. Run cohorts, measure drift, catch regressions before your users do.

runcaliper run --cohort baseline-004

▸ Running 12 agent executions against baseline...

▸ Scoring: consistency, correctness, completeness

✓ PASS onboarding-flow — 98.2% consistency (baseline: 97.1%)

✓ PASS task-creation — 99.4% correctness (baseline: 99.0%)

✗ REGR email-generation — 81.3% completeness (baseline: 94.7%) ↓ 13.4%

✓ PASS landing-page-gen — 96.8% consistency (baseline: 95.2%)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1 regression detected. Report saved → run-004-20260507.json

⊿

Define what "good" looks like by running identical inputs through your agent repeatedly. Establish statistical baselines, not vibes.

⇥

Automatic comparison against historical baselines. Know the moment quality drifts, with exact attribution to which dimension degraded.

◈

Consistency, correctness, and completeness measured independently. An agent can be correct but inconsistent. You need to know both.

⏚

Schedule cohort runs on any cadence. No manual QA, no subjective reviews. The harness runs, scores, and reports autonomously.

Current evals are broken. Baselines fix them.

Static benchmarks decay. One-shot evals miss drift. RunCaliper gives you the continuous measurement your agents need to stay reliable.