Is it just me or is Opus 4.7 an idiot and much worse than OG 4.6?

A Claude Opus 4.7 user reported that the new model release performs worse than its predecessor 4.6, making more errors and requiring significantly more corrections and input. The user preferred 4.6 for its natural writing style and ability to solve problems efficiently with minimal guidance, whereas 4.7 performed below expectations, including underperformance compared to Sonnet.

Detailed Analysis

A Reddit post on r/Anthropic has surfaced a sentiment shared by some Claude Pro subscribers: that Claude Opus 4.7 feels like a step backward compared to its predecessor, Opus 4.6. The original poster, a self-described daily Claude user, describes Opus 4.7 as requiring more correction, producing less natural outputs, and feeling worse even than Sonnet-tier models. The post invites community discussion on whether this perceived regression is widespread. However, objective benchmark data and enterprise testing tell a markedly different story — one in which Opus 4.7 represents a substantial leap forward across most meaningful dimensions of model performance.

Benchmark results across multiple independent evaluations strongly contradict the "degradation" narrative. On SWE-bench Verified, Opus 4.7 scores 87.6% compared to Opus 4.6's 80.8%, a gain of 6.8 percentage points. SWE-bench Pro shows an even larger jump, rising from 53.4% to 64.3%, and CursorBench climbs 12 points to 70%. These are not marginal improvements — they reflect the model solving coding tasks that Opus 4.6 could not complete at all. Beyond raw accuracy, Opus 4.7 is also substantially more efficient: it requires more than twice fewer LLM calls (7.1 versus 16.3), fewer tool calls, 30% fewer AI Units, and achieves a median latency of 183 seconds versus 242 seconds. Enterprise partners including Box, Bolt.new, Cursor, and Rakuten have independently validated these gains in production agentic workloads, positioning Opus 4.7 as a meaningful upgrade rather than a lateral or regressive release.

The subjective user experience of regression, while genuine in feeling, is most likely attributable to several structural changes in the model rather than a true decline in capability. Opus 4.7's significantly stronger instruction-following behavior means that prompts carefully tuned for Opus 4.6's behavioral quirks may produce noticeably different — and initially less satisfying — outputs when run without modification against 4.7. This is a well-documented phenomenon in model transitions: prompt sensitivity shifts when underlying model behavior changes, and users who have developed intuitive shorthand with one model must recalibrate. Additionally, Opus 4.7 ships with a new tokenizer that can increase token consumption by up to 1.35 times, which may subtly affect how conversations and context windows are handled, again producing outputs that feel "off" to experienced users without being objectively inferior.

One area where Opus 4.7 does demonstrate a genuine, measurable regression is BrowseComp, where it scores 79.3% — a drop of 4.4 percentage points from Opus 4.6, and well behind GPT-5.4's 89.3% on the same task. This makes Opus 4.7 a weaker choice for deep, open-ended web research tasks specifically, and users whose primary workflows center on that use case have a legitimate basis for preferring the previous model. It is plausible that users who lean heavily on research-oriented or conversational generative tasks — as opposed to coding, agentic pipelines, or app-building — are disproportionately represented among those perceiving a downgrade, since those are the domains where Opus 4.7's gains are concentrated and its one notable regression resides.

The broader trend illustrated by this episode is the growing complexity of what "better" means as frontier AI models become increasingly specialized. Anthropic's apparent design philosophy with Opus 4.7 prioritizes agentic efficiency, coding robustness, and context compression resilience — capabilities that matter enormously in enterprise and developer contexts but may be largely invisible to individual users engaged in freeform creative or research tasks. As AI labs optimize models for measurable benchmarks and production deployment scenarios, a gap can emerge between objective capability gains and the felt experience of everyday users. This tension between benchmark performance and subjective usability is not unique to Anthropic; it reflects an industry-wide challenge in communicating what model improvements actually mean for different user populations, and underscores the importance of prompt adaptation whenever a new model generation is deployed.

Read original article →

Detailed Analysis

Don't Miss a Deploy