Overview
Anthropic’s April 23 postmortem attributes a month of Claude Code quality complaints to three independent product-layer changes, not to the API or inference fleet. It’s not a capacity or region outage, but the failure modes — silent default changes, an off-by-N caching bug, and a single system-prompt line causing a 3% eval drop — are the LLM analogue of classic SRE failure patterns. Anyone building on shared model infrastructure should read it twice.
All three issues hit Claude Code, the Claude Agent SDK, and Claude Cowork. The Messages API was untouched. That the signal stayed muddy for six weeks is the bigger story.
1. Default reasoning effort: high → medium (Mar 4)
When Opus 4.6 shipped in Claude Code it defaulted to high. Tail-latency complaints (UI appearing frozen) accumulated. Anthropic’s internal evals showed medium sitting at a better operating point on the latency-vs-intelligence curve:
“In our internal evals and testing, medium effort achieved slightly lower intelligence with significantly less latency for the majority of tasks.”
User feedback disagreed. As good UX dictates, most users stayed on the default rather than reaching for /effort — so a “slightly lower” eval delta translated into a much larger perceived quality drop in the wild. On April 7 the change was reverted; Opus 4.7 now defaults to xhigh, everything else to high.
Takeaway. Moving a default operating point on a model’s test-time compute curve is one of the easiest ways to ship a silent quality regression. Internal evals undercount the human-perceived gap because most users never change defaults — defaults are the product promise.
2. A caching optimization that dropped thinking history every turn (Mar 26)
This is the most technically interesting failure. Anthropic leans hard on prompt caching — the team literally wrote “prompt caching is everything”.
The intent was clean: when a session has been idle for more than an hour and is bound for a cache miss anyway, prune older thinking blocks to reduce uncached tokens at resume time. They reached for the clear_thinking_20251015 context-editing strategy with keep:1.
The bug. Instead of running once when an idle session resumed, the clear header was attached to every subsequent request for the rest of the session. Each request told the API to keep only the most recent reasoning block and discard the rest. If a follow-up arrived mid-tool-use, even the current turn’s reasoning got dropped. Claude kept executing, but increasingly without memory of why it had picked the actions it had — surfacing as the forgetfulness, repetition, and odd tool choices users reported.
A secondary effect: every such request became a cache miss, which is what drove the parallel reports of usage limits draining unexpectedly fast.
Why it slipped through
“The changes it introduced made it past multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding.”
Three coincidences combined:
- An internal-only message-queuing experiment running concurrently muddied the signal
- An orthogonal change to thinking display suppressed the bug in most CLI sessions
- The trigger was a stale-session corner case that didn’t reproduce in dogfooding
After the fact, Anthropic back-tested Claude Code Review on the offending PRs: Opus 4.7 found the bug when given enough repo context, Opus 4.6 did not. One of the committed follow-ups is to ship multi-repo context support in Code Review to customers.
Takeaway. Don’t watch cache hit rate purely as a cost metric. A sudden jump in cache misses is a first-class signal of a context-management regression. Memory/reasoning-preservation code lures unit tests into false confidence — your multi-turn integration tests should explicitly assert how context evolves as turn count grows.
3. One system-prompt line cost 3% of evals (Apr 16)
Opus 4.7’s launch post calls out a verbose tendency in the new model — smarter on hard problems, more output tokens. Anthropic worked the problem across training, prompting, and product UX. One line in the system prompt did outsized damage:
“Length limits: keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.”
The eval set in use during pre-release testing showed no regression, so it shipped on April 16. Post-incident ablation against a broader eval suite showed a 3% drop on both Opus 4.6 and Opus 4.7. Reverted in v2.1.116 on April 20.
Takeaway. A single system-prompt line is a globally-applied config change, not an experiment. The same line affects each model differently — hence Anthropic’s new CLAUDE.md guidance that “model-specific changes are gated to the specific model they’re targeting.”
Why detection took a month — anatomy of a signal-separation failure
Three changes, three rollout schedules, three different traffic slices:
| Change | Affected models | Traffic slice | Time to find |
|---|---|---|---|
| effort default | Sonnet 4.6, Opus 4.6 | default-mode users (majority) | ~5 weeks |
| thinking-clear bug | Sonnet 4.6, Opus 4.6 | sessions resumed after 1hr idle | ~2 weeks |
| verbosity prompt | Sonnet 4.6, Opus 4.6, Opus 4.7 | everything after Opus 4.7 ship | ~4 days |
Each cohort suffered differently, and the aggregate looked like “broad, inconsistent degradation” — the worst pattern for an incident commander to disentangle. Alongside, the community surfaced detailed external audits (e.g., Stella Laurenzo’s analysis of 6,852 session files and 234,000 tool calls) that became forcing functions.
The Google SRE chapter on managing incidents frames “distinguish signal from noise” as the first job; for LLM products it gets harder because user satisfaction is inherently distributional. Reports right after a change blend confirmation bias with real regressions.
What Anthropic committed to going forward
From the postmortem’s “Going forward” section:
- Have more internal staff use the exact public Claude Code build rather than the feature-test build — closing the dogfooding gap
- Ship the internal Code Review improvements (additional repo context) to customers
- Per-model evals required for every system-prompt change, with ablation
- New tooling to review and audit prompt changes
- CLAUDE.md guidance to gate model-specific changes
- Soak periods + gradual rollouts for any change that could trade off against intelligence
- @ClaudeDevs on X and GitHub as centralized comm channels
Compared with OpenAI’s public incident pattern on its status page — mostly availability and latency events — Anthropic is unusual in formally extending the incident surface to include quality regressions.
What this means if you build on Claude (or any frontier API)
The blast radius of shared infrastructure now includes harness and system prompt, not just model weights. As a downstream operator:
- Regression-test the output distribution. Beyond latency and error rate, baseline token distribution, tool-call patterns, response lengths and diff them daily. LLM eval platforms like LangSmith and Braintrust exist for this.
- Feature-flag your own prompt changes. When your changes and the vendor’s overlap in time, signal separation becomes nearly impossible.
- Plan for multi-provider routing. Tools like LiteLLM, OpenRouter, and AWS Bedrock let you fail over models. Single-vendor dependence creates exactly this “all users simultaneously worse” pattern.
- Elevate cache hit rate to a real SLI. Sudden miss-rate jumps are both a cost signal and a context-management regression signal.
- Idempotent retries + circuit breakers still apply. Polly and resilience4j patterns work for LLM clients too — just budget for retries doubling token spend.
- Combine user feedback with quantitative metrics. Free-text reports are leading indicators of unseparated quality regressions, not noise to discard.
Insights
All three causes are LLM-flavored versions of textbook operational failures. (1) A default change broke implicit user-behavior assumptions. (2) A classic off-by-N bug sat deep in caching-optimization code and survived every layer of review and testing. (3) Eval-set coverage wasn’t broad enough to catch a 3% regression from one system-prompt line. Nothing here is new. What’s new is the diagnostic difficulty. The moment model, harness, and prompt ship as a single bundle to users, overlapping slice regressions don’t light up a status page red dot. The controls Anthropic added — required per-model evals, automated ablation, soak periods, narrowing the dogfooding gap — all amount to “apply infrastructure-grade change management to everything that ships besides the model weights.” Downstream builders should reach the same conclusion. The model is an external variable, but prompts, routing, and retry policy are ours. Without SRE-grade change discipline on our side of the line, we’ll inflict our own six-week silent degradation on our own users.
References
Primary Anthropic sources
- An update on recent Claude Code quality reports — the postmortem itself
- Lessons from building Claude Code — prompt caching is everything
- Claude Opus 4.7 launch post
- Engineering at Anthropic index
Anthropic API docs
- Extended thinking guide
- Context editing —
clear_thinking_20251015 - Prompt caching docs
- Messages API reference
- Claude Code docs
SRE / incident-response background
- Google SRE Book — Managing Incidents
- Feature Toggles (Martin Fowler)
- Scaling Test-Time Compute (Snell et al., 2024)
