I tested Anthropic’s new Claude Opus 4.7 — and it’s the first AI that actually ‘reasons’ through tasks - Tom's Guide

I tested Anthropic’s new Claude Opus 4.7 — and it’s the first AI that actually ‘reasons’ through tasks Tom's Guide [truncated: Google News RSS provides only a snippet, not full article

Detailed Analysis

Anthropic's Claude Opus 4.7 represents a significant step forward in the company's effort to build AI systems that genuinely reason through complex tasks rather than pattern-matching toward surface-level outputs. The model's headline feature — adaptive thinking — marks a departure from previous approaches that allocated a fixed reasoning budget regardless of task complexity. Instead, Opus 4.7 dynamically calibrates the number of reasoning tokens it deploys, spending more computational effort on harder problems while returning fast responses to simpler queries. This architectural shift addresses a core inefficiency in earlier reasoning-capable models and reflects a maturing understanding of how inference-time compute should be managed. The model also introduces a new "xhigh" effort level, filling a gap between the existing high and max reasoning settings and giving developers finer-grained control over the performance-cost tradeoff.

The most concrete evidence of Opus 4.7's reasoning improvements comes from coding benchmarks, where performance gains are both measurable and operationally meaningful. On CursorBench, the model achieved 70% compared to its predecessor Opus 4.6's 58% — a 12-percentage-point jump that signals genuine capability improvement rather than marginal refinement. On SWE-bench Pro and SWE-bench Verified, Opus 4.7 scored 64.3% and 87.6% respectively, demonstrating stronger performance on the kinds of long-horizon, multi-step software engineering tasks that are increasingly central to real-world AI deployment. Improvements in deductive logic — an area where Opus 4.6 notably struggled — further round out the model's analytical profile. Reviewers testing the model in live code review scenarios found it surfacing issues that competing models either missed or abandoned, suggesting the gains are not purely benchmark-driven.

The release of Opus 4.7 situates Anthropic within a broader industry race to build models that can operate with greater autonomy on complex, open-ended tasks. The emphasis on agentic coding workflows — tasks requiring sustained planning, tool use, and error correction over extended sessions — reflects where enterprise AI demand is concentrating. Historically, such workflows exposed weaknesses in instruction-following and ambiguity resolution; Opus 4.7's documented improvements in both areas suggest Anthropic is directly targeting deployment friction rather than simply optimizing for leaderboard scores. The model's stronger systems engineering performance also signals a push toward reliability in production environments, not just research benchmarks.

Notably, Opus 4.7 is not Anthropic's most powerful model. Claude Mythos Preview occupies that position but remains in limited release due to safety concerns — a detail that underscores Anthropic's continued prioritization of safety-gated deployment for its frontier systems. This tiered release strategy, where the most capable models are held back while improved but safer versions are broadly deployed, reflects the company's constitutional approach to responsible scaling. It also creates a visible capability ceiling for Opus 4.7, signaling to developers and researchers that a more powerful tier exists but is not yet considered ready for wide availability. This dynamic is increasingly common across the frontier AI landscape, as companies balance competitive pressure to release capable models against the reputational and safety risks of premature deployment. Opus 4.7's release thus functions both as a genuine product milestone and as a public demonstration of Anthropic's approach to managing the gap between capability and readiness.

Read original article →

Detailed Analysis

Don't Miss a Deploy