Opus 4.7 is a regression from 4.6 - real-world document generation broken

A production user found that Opus 4.7 failed to generate a Word document properly, producing instead a plain text markdown file saved with a .docx extension, with no warning of the error. When asked to correct the document using the original as a template, the model took an inefficient approach by attempting to edit individual XML cells one at a time, exhausting the session's tool budget before completion. The user characterized the release as a regression despite higher benchmark scores, arguing that real-world workflow reliability should outweigh academic evaluation performance.

Detailed Analysis

Anthropic's release of Claude Opus 4.7 on April 16, 2026 has generated significant user backlash, with a core complaint emerging from professional users who rely on the model for production document workflows rather than conversational or analytical tasks. The Reddit post in question describes a specific and reproducible failure: when tasked with updating a Word document — a routine operation successfully handled by Opus 4.6 — Opus 4.7 produced a plain-text markdown file with a .docx extension, confidently delivering a structurally invalid file without any acknowledgment of error. When the user attempted to correct the output using the model's own tool access, the model adopted an inefficient cell-by-cell XML editing strategy rather than regenerating the document in a single pass, ultimately exhausting the session's tool budget and issuing a handoff document requesting continuation in a fresh session. The user reverted to Opus 4.6, which completed the identical task in one pass.

The complaint is not anecdotal in isolation. Benchmark data corroborates the regression: on the MRCR v2 long-context retrieval benchmark, Opus 4.7 scores 32.2% at one million tokens compared to 78.3% for Opus 4.6 — a 46-point drop — and falls from 91.9% to 59.2% at 256k tokens on the eight-needle retrieval task. This degradation in long-context document retrieval maps directly onto the class of failures users are reporting in production environments, including RAG pipelines, document agents, and multi-part file analysis tools. The social signal reinforces the technical signal: a Reddit post characterizing 4.7 as a "serious regression" accumulated 2,300 upvotes, while a related post on X received 14,000 likes — unusually high engagement for a model quality complaint in a space accustomed to incremental iteration.

The failure mode described in the post is particularly damaging because it is a confident failure. The model did not degrade gracefully or flag uncertainty; it delivered a broken file with full confidence and then compounded the error with a demonstrably inefficient recovery strategy. This pattern — high confidence, poor task decomposition, and failure to self-monitor resource consumption — points to something more concerning than a narrow capability regression. It suggests that the model's internal calibration around task difficulty and approach selection may have shifted in ways that are not captured by headline benchmark scores. The poster's core critique, that the model "made a poor decision about how to approach the task" and "did not recognise the inefficiency of its own strategy," targets the model's metacognitive and agentic reasoning layer rather than its raw capability on any discrete subtask.

Anthropic's own release documentation positions Opus 4.7 as a net improvement, citing gains on coding benchmarks, vision tasks, BigLaw Bench (90.9% accuracy), and OfficeQA Pro (21% fewer errors than 4.6). These gains are real and corroborated by developer feedback in engineering-heavy workflows. The issue is that model releases are increasingly evaluated across heterogeneous use cases, and a model that improves on coding and structured benchmarks while regressing on long-document retrieval and file generation represents a capability tradeoff, not an unambiguous upgrade. Anthropic did publish migration notes addressing tokenizer changes that increase input token counts by 1.0–1.35×, but did not prominently flag the long-context retrieval regression — a gap in transparency that has amplified user frustration. A mid-April verbosity-reducing prompt change that degraded coding quality across multiple models, including 4.7, and was subsequently reverted on April 20, further signals that the deployment pipeline for 4.7 encountered instability beyond the initial release.

The broader pattern here reflects a structural tension in frontier AI development between benchmark-driven release cycles and the demands of users who have integrated prior model versions into load-bearing workflows. As Claude models are increasingly used in agentic, multi-step production contexts — precisely the use cases Anthropic has been publicly prioritizing — the cost of behavioral regressions rises substantially. A model that fails mid-task, consumes an entire session budget, and requests continuation is not merely underperforming; it is actively creating work for the user it was meant to serve. The growing practice of users maintaining fallback versions of prior models — the poster reverted to 4.6, developers are formally advised to test against 4.6 for document-heavy applications — points to an emerging reliability expectation that the industry has not yet standardized around: that model upgrades should preserve, at minimum, the capability floor of their predecessors on tasks already in production use.

Read original article →

Detailed Analysis

Don't Miss a Deploy