Detailed Analysis
A Reddit user on r/Anthropic, identifying as a Claude Max subscriber, has reported a notable degradation in Claude Sonnet 4.6's performance across everyday chat-based tasks over a short window of roughly 24 hours. The complaints span several distinct failure modes: non-compliance with explicit instructions requiring multiple follow-up prompts, errors on tasks previously handled without difficulty (such as parsing speaker attribution in screenshot-based text conversations), a perceived loss of conversational warmth, unreliable Notion tool execution despite the model reporting success, and difficulty accessing project files. The user explicitly notes they do not use Claude Code, situating their experience entirely within standard chat interfaces for writing and personal productivity workflows.
The reported issues exist in tension with Anthropic's documented performance data for Sonnet 4.6. Across standardized benchmarks and enterprise evaluations, the model demonstrably outperforms its predecessor: early-access developers preferred Sonnet 4.6 over 4.5 approximately 70% of the time, Box reported a 15-percentage-point improvement in heavy reasoning tasks, and OSWorld computer-use benchmarks showed a jump from 61.4% to 72.5%. These gains are not trivial and span coding, reasoning, and agentic tool use — suggesting Anthropic did not introduce a universal regression with the 4.6 release. The disconnect between aggregate benchmark data and individual user experience is a recurring challenge in evaluating large language model deployments.
A likely explanatory factor is the well-documented performance gap between agentic and constrained chat environments. Research context indicates that Claude Sonnet 4.6's strongest results emerge when the model is given greater operational freedom — as in Claude Code or multi-step agentic pipelines — rather than in standard turn-by-turn chat interfaces. This architectural asymmetry means that a model can show measurable improvements on benchmarks while simultaneously feeling less capable to users whose workflows depend on a constrained chat context. The tool-execution failures the user describes, particularly the Notion integration reporting success without actually writing data, are consistent with known reliability challenges in agentic tool-calling when operating through intermediary layers rather than native integrations.
There is also a separate and harder-to-quantify dimension to the complaint: the perception that Sonnet 4.6's outputs have become less personable and that its reasoning has grown more verbose or formulaic. Reports from technical communities echo this, with some developers observing that Sonnet 4.6's reasoning tokens, which were initially more tightly structured, evolved toward greater verbosity over time — a pattern some liken to behavior seen in competing models. Whether this reflects post-launch inference infrastructure changes, prompt-layer adjustments, or natural model variability remains unclear. Anthropic has not publicly acknowledged a targeted regression, and the pattern of complaints appears to be an emerging user-reported signal rather than a confirmed engineering issue.
The broader significance of this post lies in what it reveals about the mismatch between how AI labs measure and communicate model capability versus how end users actually experience it. Benchmark improvements in agentic and enterprise contexts do not automatically translate into a better experience for the large base of users who interact with these models through everyday chat for writing, task management, and personal productivity. As Anthropic continues to optimize Sonnet 4.6 — and as the industry broadly tilts toward agentic use cases as the primary design target — the risk grows that mainstream chat users become a secondary optimization priority, experiencing performance plateaus or perceived regressions even as headline benchmark numbers climb.
Read original article →