agent regression watch

Coding agents get nerfed.

Developers find out late. nerfed.watch turns “it feels worse” into evidence before teams absorb the damage.

┌─ NERFED.WATCH ──────────────┐
│ AGENT QUALITY CAN REGRESS   │
│ ALERTS > VIBES > BENCHMARKS │
└─────────────────────────────┘

Subscribe for updates See the problem

first public resultswithheld for v1 benchmark

v1 pack before scoresClaude Code + Codex first waveEvidence before claims

WHY THIS EXISTS

Agent degradation is real. Public scores need better evidence.

We monitor the gap between “it feels worse” and proof strong enough to warn engineering teams.

Claude Code

Publicly acknowledged regression.

Codex

Recent reliability-drop reports.

Your team

Quality, review load, and security risk move before anyone measures.

AGENT ROSTER

First wave: high-usage coding agents.

Claude Code and Codex first. Others queue after the v1 benchmark pack is credible.

Claude Code

first wave

Release drift watch after visible quality regressions.

Codex

first wave

Same task packs, evidence policy, and caveat rules.

Grok Code

coming soon

Queued for comparable coding-agent monitoring.

Cursor Agent

coming soon

Planned for editor-agent repo-context runs.

Others

coming soon

Additional coding agents will be added after the v1 pack is credible.

Nerfed?

Catch silent quality drops before they hit production code.

Late?

Developers usually notice only after bad patches and wasted review.

Where?

Separate model drift, tool failures, and task-family regressions.

TASK FAMILIES

We monitor real engineering work, not toy prompts.

The benchmark focuses on practical workflows where a nerfed agent can damage code quality or safety.

Bug diagnosis

Root cause, not patch guessing.

Test repair

Fix stale tests without deleting signal.

Feature slices

Bounded changes across real files.

Refactors

Structure changes with locked behavior.

Tool use

Shells, tests, package managers, repo search.

Long context

Large repos without losing intent.

METHOD

Built to catch nerfs, regressions, and release drift.

The useful signal is movement: whether an agent became less reliable, less safe, slower to recover, or more expensive after a release.

Lock

Version, prompt, budget, tool policy.

Run

Same pack, same evidence rules.

Audit

Diffs, logs, tests, caveats.

Withhold

No weak public numbers.

Alert

Publish only credible drift.

SIGNALS

What each alert will make visible

Quality

Pass ratePatch correctnessSecurity-sensitive mistakes

Reliability

Tool failuresRecovery behaviorInvalid-run rate

Impact

Review burdenElapsed timeRegression severity

REPORT FORMAT

Not a hype leaderboard — a regression alert system.

Reports will separate score movement, confidence, task-family failures, safety-sensitive mistakes, and evaluator notes. If a run is noisy or invalid, it will say so clearly.

agentreleaserisk signal

Claude Codeversioned runfirst wave monitoring

Codexversioned runfirst wave monitoring

Agent + versionTask packRun environmentRegression movementQuality riskEvidence linksConfidence + caveats

First public test results are planned to update in 48 hours.

GUARDRAILS

Direct does not mean reckless

No panic headline without evidence.
No hidden model or version swaps.
No winner declared from a single noisy run.
Invalid or inconclusive runs are labeled, not buried.

ROADMAP

From subscribers to live nerf monitoring

Now
Landing page, subscribe flow, methodology draft
48h
First public test-result update for early subscribers
Next
Benchmark harness and held-out task taxonomy
Ongoing
Twice-weekly monitoring for agent nerfs and release drift

Coding agents get nerfed.

Agent degradation is real. Public scores need better evidence.

Claude Code

Codex

Your team

First wave: high-usage coding agents.

Claude Code

Codex

Grok Code

Cursor Agent

Others

Nerfed?

Late?

Where?

We monitor real engineering work, not toy prompts.

Bug diagnosis

Test repair

Feature slices

Refactors

Tool use

Long context

Built to catch nerfs, regressions, and release drift.

Lock

Run

Audit

Withhold

Alert

What each alert will make visible

Quality

Reliability

Impact

Not a hype leaderboard — a regression alert system.

Direct does not mean reckless

From subscribers to live nerf monitoring

Get benchmark updates when the signal is credible.