Opus 4.7 — Regression in conversational coherence and context handling vs Opus 4.6

A regression report documents multiple failure modes in Opus 4.7 compared to Opus 4.6, including token-level generation errors, factual conflation from memory, excessive information dumping, failure to integrate corrections, and attention hijacking between safety mechanisms and domain reasoning. Testing revealed that Opus 4.7 optimizes for completeness over precision, generating voluminous output that sacrifices factual accuracy, conversational coherence, and responsiveness to corrections in favor of comprehensive but undirected responses similar to GPT.

Detailed Analysis

A user-published regression report dated April 18, 2026, documents a series of alleged performance degradations in Claude Opus 4.7 relative to Opus 4.6, observed across both production deployments and structured controlled testing. The report catalogs eight distinct failure modes, including token-level generation artifacts producing grammatically incoherent phrases, factual conflation between separate memory entries, systematic overgeneration of unsolicited structured content, and failure to integrate user corrections stored in memory. The author's production environment used identical system prompts, memory infrastructure, and tooling across both model versions, lending some methodological weight to the comparisons. Additional structured tests revealed what the report characterizes as a novel and critical failure mode: safety pattern recognition appearing to "hijack" domain-specific reasoning, such that Opus 4.7 correctly identified a controlled substance synthesis route and refused assistance, but simultaneously produced a factually incorrect chemistry claim that directly contradicted text present in the source material — a failure Opus 4.6 did not exhibit under the same prompt. Physical-world consistency failures and a confirmed overgeneration impulse on trivial questions rounded out the structured findings.

The report's core thesis — that Opus 4.7 optimizes for completeness over precision, generating more tokens and more structured output at the cost of factual accuracy and conversational coherence — stands in notable tension with the broader landscape of published assessments. Independent benchmarking and user reports from other contexts describe Opus 4.7 as a meaningful upgrade, particularly for long-running agentic workflows, where it reduces LLM calls by more than 2x (7.1 versus 16.3), lowers median latency from 242 to 183 seconds, and achieves a 12-point gain on CursorBench (70% versus 58%). Anthropic's own release materials emphasize stronger instruction adherence, improved context carryover in extended sessions, and coding benchmark improvements of approximately 13%. The divergence between the regression report and these sources is not necessarily a contradiction — it is consistent with a model that performs measurably better on structured, agentic, and coding-oriented tasks while behaving differently in conversational and memory-integrated contexts, particularly for users with highly customized system prompts and external memory architectures.

The safety-reasoning interference pattern described in the chemistry test deserves particular analytical attention, as it surfaces a tension that has become increasingly prominent as AI safety mechanisms have grown more sophisticated. The report's framing — that the guardrail "fired correctly" while "the brain did not" — describes a resource allocation problem at the attention level, where safety classification and content generation compete rather than operate in parallel. The author notes this same behavior pattern in GPT and Gemini, suggesting it may reflect a cross-industry architectural tradeoff in how safety classifiers are integrated into generation pipelines. Whether this represents a deliberate design choice in Opus 4.7, a side effect of training optimizations, or an artifact of the specific prompt structure used in testing cannot be determined from the report alone, but the structured nature of the test — using a chemistry problem with the answer embedded in the source text — makes the failure mode diagnostically clean and reproducible in principle.

The overgeneration pattern documented throughout the report connects to a broader, well-observed trend in frontier model development: the tension between completeness and calibration. Models trained on human feedback frequently receive positive signal for thorough, structured, comprehensive responses, which can produce systematic overgeneration in contexts where brevity and tool use are more appropriate. The report's observation that Opus 4.7's output style is "indistinguishable from GPT — comprehensive but undirected" reflects this dynamic and implicitly critiques a training signal that may reward coverage over precision. The simultaneous finding that Opus 4.7 is more literal in instruction following — a trait confirmed by independent sources — creates an apparent paradox: the model is stricter about what users explicitly request but generates far more than what was requested. This suggests the overgeneration may operate at a different layer than instruction parsing, potentially as a default generation prior that precedes rather than responds to instruction interpretation.

The report's limitations are relevant to its interpretation. It represents a single user's production environment and a limited set of structured test cases, without statistical replication or isolation of confounding variables such as prompt phrasing, memory retrieval latency, or session-level context window effects. The comparison to Opus 4.6 is qualitative rather than quantitative, and the memory integration failures may reflect interactions between the external memory architecture and the model rather than intrinsic model behavior. Nonetheless, the structured test methodology — running identical prompts in parallel across both model versions via API — provides a meaningful signal, and the specificity of the documented failures (token-level artifacts, physics consistency errors, mid-generation self-correction acknowledgment) suggests careful observation rather than casual complaint. For Anthropic, the report represents the kind of granular, deployment-specific feedback that aggregate benchmarks rarely capture, particularly the safety-reasoning interference finding, which warrants investigation regardless of whether it generalizes beyond this user's context.

Read original article →

Detailed Analysis

Don't Miss a Deploy