My experience and questions with Claude 4.7 after 2 days and a few million tokens

Claude 4.7 demonstrates superior instruction-following and occasional reasoning improvements over Opus 4.6, though it exhibits increased hallucination and corner-cutting, particularly in web-based tasks. The model delivers impressive output in terminal environments and semantic synthesis work but consumes tokens at approximately four times the rate of its predecessor, suggesting possible parallel processing or simultaneous compilation in its orchestration. The user is evaluating whether to maintain Opus 4.6 for stability or adopt a hybrid workflow using Sonnet for initial task completion followed by 4.7 evaluation.

Detailed Analysis

A Reddit user's first-hand account of operating Claude Opus 4.7 across several million tokens offers a granular, practitioner-level assessment of the model released by Anthropic on April 16, 2026. The user reports a clear improvement in instruction-following over Opus 4.6 and highlights a standout performance in Research Mode, where the model scanned roughly 5,100 sources and persisted until completing its objective — a behavior suggestive of enhanced agentic persistence. However, those gains are offset by a marked increase in hallucination frequency during daily reasoning tasks, with the user noting the model often self-corrects after the fact, implying awareness of its own errors but an inability to suppress them during initial generation. Performance also appears to degrade meaningfully on the web interface compared to terminal use, where outputs for semantic synthesis and rewriting are described as exceptional.

The cost dimension stands out as a central concern. The user estimates token consumption running approximately four times higher than with Opus 4.6, a figure that aligns directionally with broader community reports of billing increases of up to 40%, though the user's figure is considerably more severe — likely reflecting heavy agentic and multi-step workflows. The user speculates that the model may be running parallel agents or conducting a form of simultaneous compilation, and describes the subjective experience as though a faster, lighter model such as Haiku or Sonnet is generating output while Opus 4.6-level evaluation runs concurrently on top. This architectural intuition, while unconfirmed, is consistent with Anthropic's documented focus on self-verification capabilities as a core improvement in 4.7. Boris Cherny, lead for Claude Code at Anthropic, acknowledged community friction by noting that the model requires several days of adjustment to be leveraged effectively — a statement that implicitly confirms real-world friction exists at the transition.

The user's proposed mitigation strategies illuminate a broader tension in how developers are adapting to the model's cost-performance profile. Rather than accepting 4.7 as a drop-in replacement, the user considers a tiered orchestration approach — routing tasks first through Sonnet for drafting, then using 4.7 exclusively for evaluation — effectively treating the more capable model as an auditor rather than a primary generator. This pattern reflects an emerging development philosophy in the AI engineering community: as frontier models become more powerful but also more expensive and behaviorally complex, practitioners increasingly architect workflows that disaggregate reasoning tasks rather than routing everything through a single top-tier model. The user's alternative — reverting to Opus 4.6 — reflects the real cost of capability regressions in production environments, where consistency and predictability often matter more than peak performance on benchmarks.

The discrepancy between Anthropic's official benchmark claims and community-reported experience is worth contextualizing carefully. Opus 4.7 demonstrably outperforms 4.6 on 12 of 14 official tests and resolves three times more production tasks on Rakuten's SWE-Bench. Yet documented regressions — including a multi-hop retrieval collapse from 78.3% to 32.2% and increased literal interpretation of prompts that previously relied on implicit understanding — reveal that aggregate benchmark scores can mask significant distributional shifts in model behavior. Users whose workflows depend on the specific capabilities that regressed will experience a net downgrade regardless of headline improvements. This is a recurring challenge in large language model deployment: version-to-version capability changes are rarely monotonic across all task types, and enterprises or power users with specialized workflows often bear the costs of capability trade-offs that aggregate metrics obscure.

The episode underscores a maturation point in the frontier AI model market, where users sophisticated enough to operate at multi-million-token scales have developed nuanced frameworks for evaluating model transitions — moving well beyond simple quality comparisons toward cost-adjusted performance analysis, orchestration architecture design, and workflow-specific regression testing. Anthropic's acknowledgment that adjustment time is needed signals an awareness that model releases now carry integration costs for advanced users, not just capability promises. As Claude models continue to evolve, the growing community of high-volume practitioners sharing empirical data at this level of detail is increasingly shaping the practical narrative around model releases, functioning as a parallel evaluation layer alongside formal benchmarking.

Read original article →

Detailed Analysis

Don't Miss a Deploy