agent regression watch

Coding agents get nerfed.

Developers find out late. nerfed.watch turns “it feels worse” into evidence before teams absorb the damage.

first public resultswithheld for v1 benchmark
v1 pack before scoresClaude Code + Codex first waveEvidence before claims

WHY THIS EXISTS

Agent degradation is real. Public scores need better evidence.

We monitor the gap between “it feels worse” and proof strong enough to warn engineering teams.

Claude Code

Publicly acknowledged regression.

Codex

Recent reliability-drop reports.

Your team

Quality, review load, and security risk move before anyone measures.

AGENT ROSTER

First wave: high-usage coding agents.

Claude Code and Codex first. Others queue after the v1 benchmark pack is credible.

Claude Code

first wave

Release drift watch after visible quality regressions.

Codex

first wave

Same task packs, evidence policy, and caveat rules.

Grok Code

coming soon

Queued for comparable coding-agent monitoring.

Cursor Agent

coming soon

Planned for editor-agent repo-context runs.

Others

coming soon

Additional coding agents will be added after the v1 pack is credible.

Nerfed?

Catch silent quality drops before they hit production code.

Late?

Developers usually notice only after bad patches and wasted review.

Where?

Separate model drift, tool failures, and task-family regressions.

TASK FAMILIES

We monitor real engineering work, not toy prompts.

The benchmark focuses on practical workflows where a nerfed agent can damage code quality or safety.

Bug diagnosis

Root cause, not patch guessing.

Test repair

Fix stale tests without deleting signal.

Feature slices

Bounded changes across real files.

Refactors

Structure changes with locked behavior.

Tool use

Shells, tests, package managers, repo search.

Long context

Large repos without losing intent.

METHOD

Built to catch nerfs, regressions, and release drift.

The useful signal is movement: whether an agent became less reliable, less safe, slower to recover, or more expensive after a release.

01

Lock

Version, prompt, budget, tool policy.

02

Run

Same pack, same evidence rules.

03

Audit

Diffs, logs, tests, caveats.

04

Withhold

No weak public numbers.

05

Alert

Publish only credible drift.

SIGNALS

What each alert will make visible

Quality

Pass ratePatch correctnessSecurity-sensitive mistakes

Reliability

Tool failuresRecovery behaviorInvalid-run rate

Impact

Review burdenElapsed timeRegression severity

REPORT FORMAT

Not a hype leaderboard — a regression alert system.

Reports will separate score movement, confidence, task-family failures, safety-sensitive mistakes, and evaluator notes. If a run is noisy or invalid, it will say so clearly.

agentreleaserisk signal
Claude Codeversioned runfirst wave monitoring
Codexversioned runfirst wave monitoring
Agent + versionTask packRun environmentRegression movementQuality riskEvidence linksConfidence + caveats
First public test results are planned to update in 48 hours.

GUARDRAILS

Direct does not mean reckless

ROADMAP

From subscribers to live nerf monitoring

  1. Now

    Landing page, subscribe flow, methodology draft

  2. 48h

    First public test-result update for early subscribers

  3. Next

    Benchmark harness and held-out task taxonomy

  4. Ongoing

    Twice-weekly monitoring for agent nerfs and release drift

SUBSCRIBE

Get benchmark updates when the signal is credible.

Subscribe for methodology notes, v1 benchmark progress, agent regression alerts, and future public result drops. No noisy leaderboard spam.

Subscribe for benchmark updates, methodology notes, and agent regression alerts.