Claude Code
Publicly acknowledged regression.
Developers find out late. nerfed.watch turns “it feels worse” into evidence before teams absorb the damage.
WHY THIS EXISTS
We monitor the gap between “it feels worse” and proof strong enough to warn engineering teams.
Publicly acknowledged regression.
Recent reliability-drop reports.
Quality, review load, and security risk move before anyone measures.
AGENT ROSTER
Claude Code and Codex first. Others queue after the v1 benchmark pack is credible.
Release drift watch after visible quality regressions.
Same task packs, evidence policy, and caveat rules.
Queued for comparable coding-agent monitoring.
Planned for editor-agent repo-context runs.
Additional coding agents will be added after the v1 pack is credible.
Catch silent quality drops before they hit production code.
Developers usually notice only after bad patches and wasted review.
Separate model drift, tool failures, and task-family regressions.
TASK FAMILIES
The benchmark focuses on practical workflows where a nerfed agent can damage code quality or safety.
Root cause, not patch guessing.
Fix stale tests without deleting signal.
Bounded changes across real files.
Structure changes with locked behavior.
Shells, tests, package managers, repo search.
Large repos without losing intent.
METHOD
The useful signal is movement: whether an agent became less reliable, less safe, slower to recover, or more expensive after a release.
Version, prompt, budget, tool policy.
Same pack, same evidence rules.
Diffs, logs, tests, caveats.
No weak public numbers.
Publish only credible drift.
SIGNALS
REPORT FORMAT
Reports will separate score movement, confidence, task-family failures, safety-sensitive mistakes, and evaluator notes. If a run is noisy or invalid, it will say so clearly.
GUARDRAILS
ROADMAP
Landing page, subscribe flow, methodology draft
First public test-result update for early subscribers
Benchmark harness and held-out task taxonomy
Twice-weekly monitoring for agent nerfs and release drift