← Reddit

4.6>4.7

Reddit · Major-Gas-2229 · April 27, 2026

Detailed Analysis

Claude Opus 4.7, released around April 16, 2026, as Anthropic's top generally available model, delivers measurable performance gains over its predecessor, Claude Opus 4.6, across coding and vision benchmarks while introducing meaningful cost and safety trade-offs. On SWE-bench Verified, 4.7 scores 87.6% versus 4.6's 80.8%, a gain of 6.8 percentage points, and the gap widens further on SWE-bench Pro (64.3% vs. 53.4%, +10.9pp) and CursorBench (70% vs. 58%, +12pp). Vision performance similarly advances, with 4.7 processing images at three times the resolution of 4.6 and improving visual reasoning by 13%. On Rakuten's production benchmark, 4.7 resolves approximately three times as many tasks. These are not incremental improvements — they represent a substantive generational shift in capability for the workloads that most benefit from extended, high-effort reasoning.

The architectural philosophy behind 4.7 centers on what Anthropic describes as "adaptive" thinking: the model generates more output at higher effort levels and treats instruction sets — particularly bulleted lists — as hard requirements rather than soft guidance. This produces stricter, more precise task execution in long agentic workflows and reduces the number of prompt refinement cycles needed to achieve a target result. However, this same behavioral shift creates friction in shorter, more conversational contexts, where the increased verbosity translates directly into token overhead. Output tokens for 4.7 are priced at $25 per million versus $5 per million for input tokens, and 4.7 uses up to 35% more tokens per request than 4.6, making it materially more expensive for high-volume or low-complexity use cases where its additional capability goes largely unutilized.

The safety regression in harm-reduction contexts — specifically, guidance around controlled substances — represents a non-trivial deployment consideration. Anthropic's own documentation acknowledges that 4.7 is "modestly weaker" in this area, which means organizations running health, crisis, or drug-policy workflows face a genuine reason to hold on 4.6 rather than upgrade by default. This pattern — capability advancing faster than safety alignment in specific domains — is a recurring challenge in frontier model development and underscores the importance of benchmark granularity beyond headline scores. A model can be objectively better on aggregate while being measurably worse in high-stakes subcategories.

UI design testing reveals a subtler but instructive split: 4.7 produces visually sophisticated, Apple-style outputs that score higher on aesthetic appeal but exhibit more structural flaws — inconsistent spacing, invented or hallucinated icons — compared to 4.6's more grounded and consistent results. This illustrates a broader tension in large language model development between surface-quality outputs and functional reliability. A model optimized for high-effort, extended reasoning tasks can develop a form of overconfidence in generative domains where groundedness and constraint matter more than stylistic ambition. The practical takeaway for product teams is that 4.7's design outputs may require more post-generation QA, not less.

The 4.6-versus-4.7 decision ultimately functions as a workload routing problem rather than a straightforward upgrade path. Agentic coding pipelines, long-running autonomous agents, high-resolution image analysis, and hard-problem reasoning represent the clear upgrade cases where 4.7's benchmark gains justify its cost premium. Short-turn conversations, extractive question-answering, harm-reduction workflows, and consistency-sensitive UI generation remain better served by 4.6. This kind of model-tier differentiation — where the "newer" model is not universally superior — reflects the maturing state of the frontier AI market, where organizations are increasingly expected to evaluate models at the task level rather than treating version increments as blanket improvements.

Read original article →