Detailed Analysis
Anthropic's Boris Cherny, the creator and lead of Claude Code, published a post-mortem report on April 23, 2026, formally acknowledging that the Claude Code product had degraded in performance — not due to changes in the underlying Claude model, but as a direct result of deliberate product-level modifications made between February and March 2026. The admission followed weeks of controversy sparked by AMD's Stella Laurenzo, who conducted a rigorous empirical analysis of 6,852 Claude Code sessions, 234,760 tool calls, and 17,871 thinking blocks drawn from production logs spanning January through March 2026. Her findings, published in a GitHub issue on April 2, documented a striking 73% drop in median visible thinking length — falling from 2,200 to just 600 characters — and up to 80 times more API retries per task. Cherny initially disputed her conclusions before the internal investigation validated the data in full, making the eventual post-mortem a significant reversal for Anthropic.
Three specific product changes were identified as the compounding root causes of the degradation. On February 9, Anthropic introduced adaptive thinking by default, shifting Claude Code away from a fixed reasoning budget toward dynamically adjusted reasoning depth per task. On February 12, intermediate thinking was redacted from user-facing interfaces to reduce latency — a change Anthropic framed as UI-only but which showed measurable impacts in session transcripts. Then on March 3, the default effort level was dropped from high to a medium setting at level 85. None of these changes individually triggered adequate pre-deployment evaluation, and their interactions created compounding negative effects. Cherny characterized the investigation as "the most complex we've had," noting the presence of confounders including separate issues specific to the Opus 4.7 model. Anthropic has since committed to per-model evaluations, gradual rollouts, an improved Code Review tool, and resetting usage limits for affected subscribers.
The episode carries considerable significance for the broader AI developer tooling ecosystem, where trust and consistency of performance are foundational to professional adoption. Users reported not only the visible degradation in reasoning quality but also ancillary compounding problems including cache invalidation bugs and reduced API quotas, which amplified operational costs for teams running Claude Code in production environments. The fact that Laurenzo's analysis — produced externally, by an AMD researcher — proved more accurate than Anthropic's initial internal assessment underscores the growing importance of third-party empirical scrutiny in AI product accountability. It also raises questions about internal evaluation rigor at a company that has positioned safety and reliability as central to its identity.
Zooming out, the Claude Code post-mortem reflects a pattern emerging across the frontier AI industry: the gap between model capability and product-layer implementation is increasingly where real-world performance is won or lost. Anthropic's own acknowledgment that the underlying Claude model did not degrade — yet users experienced substantial degradation — highlights how inference configuration, UX decisions, and cost-optimization choices can functionally diminish a model's effective capability without any change to the weights themselves. This dynamic is particularly acute in agentic coding tools, where reasoning depth and retry reliability have outsized downstream effects on task completion. The industry is still developing best practices for testing these layered systems holistically rather than in isolation.
The controversy also arrives at a moment when Cherny has been publicly predicting the obsolescence of traditional developer tools, including VS Code and Xcode, by late 2026, as AI agents assume full coding workflows. That forward-looking vision now sits in tension with a post-mortem that revealed Anthropic's own flagship coding agent quietly regressed under product changes that went undetected until an outside researcher intervened. The juxtaposition sharpens a central challenge for the AI-native development era: as the stakes of agent reliability grow, so too does the cost of silent, incremental performance erosion — and the premium on transparency when it occurs.
Read original article →