Why is it still so wrong opus 4.7 . Is 1M context just way to run things longer with worse outcome?

Claude's performance deterioration in extended conversations stems from context degradation, where initial rules lose attention weight as tool results and execution pressure accumulate. Action bias causes the model to revert to default helpful behavior despite explicit instructions to ask first, while goal fixation treats each new obstacle as in-scope without reconsidering broader scope. The available memory system remains unused to persist rules beyond the initial context, allowing earlier guardrails to decay as the conversation progresses.

Detailed Analysis

A Reddit post in the r/Anthropic community surfaces a technically sophisticated critique of Claude Opus 4.7's behavior in long-context agentic sessions, raising questions about whether the model's 1 million token context window produces better outcomes or simply longer, more error-prone execution chains. The post's author — apparently working with Claude through an agentic harness with explicit behavioral rules defined in a CLAUDE.md file — documents four distinct failure modes observed during extended sessions: context degradation (early instructions losing attention weight as the context fills with tool results), action bias (the model reverting to its "be helpful, take action" default disposition despite explicit instructions to ask before executing), goal fixation (treating cascading errors as in-scope problems to fix rather than signals to stop and reconsider), and failure to use available memory systems that would allow rules to persist across conversations. These are not vague impressions; they are behaviorally precise observations about how large language model attention mechanisms interact with long-context execution pressure.

The concerns raised in the post reflect real and documented tradeoffs in Opus 4.7's architecture. The model ships with a new tokenizer that increases token consumption by approximately 35% relative to previous Claude versions, directly inflating costs for long sessions. Processing a full 1M token context requires upward of 20 minutes of wall-clock time, making it structurally unsuitable for latency-sensitive or iterative workflows. API usage data further suggests that session failure rates are rising — requests increased 80% between February and March while thinking depth per request declined — consistent with users experiencing more retries and incomplete runs. The context degradation the Reddit author describes aligns with a known limitation in transformer-based models: attention over very long sequences does not distribute uniformly, and instructions embedded early in a context window can lose effective weight as the sequence grows.

However, the framing of "1M context as just running things longer with worse outcomes" mischaracterizes what Opus 4.7 was engineered to do. The model's primary architectural claim is not merely a larger window, but a meaningfully reduced performance degradation curve as context approaches its limit — a property most competing models, including earlier Claude versions, do not share. For specific use cases — whole-codebase reasoning without manual file chunking, long-running agents that need to maintain state without external retrieval systems, or cross-document legal and compliance analysis — this represents a genuine capability shift rather than a marketing expansion. The failure modes the Reddit author describes are real but are also implementation failures as much as model failures: the post itself acknowledges that a memory system was available and never used, that rules were placed only in the conversation context rather than in persistent storage, and that no prompt caching was mentioned. Prompt caching at 1M scale is, per available developer documentation, load-bearing infrastructure rather than an optional optimization.

The broader tension the post exposes is between what large context windows make possible in principle and what they require in practice to function reliably. Agentic Claude sessions at this scale demand explicit architectural discipline: persistent memory for behavioral rules, prompt caching to control cost and latency, and context right-sizing (targeting 50K–500K tokens for iterative work rather than defaulting to the maximum). Without these, the failure modes the Reddit author catalogs — action bias, goal fixation, rule decay — become predictable rather than surprising. The model's "ask before execution" override being "thin under execution pressure" is a precise description of how instruction-following degrades when the instructional signal is embedded in a decaying context rather than reinforced through structured system prompts or memory writes. This is not a bug unique to Opus 4.7; it is an intrinsic property of how current LLMs process long sequences.

The post ultimately reflects a wider pattern in the Claude developer community: users encountering Opus 4.7 expecting a drop-in upgrade and instead discovering that 1M context is a capability that requires co-design to use well. Anthropic's positioning of the model as purpose-built for agentic and long-session use cases is accurate, but the operational burden of that design — caching strategy, memory system integration, context discipline — falls on the developer. The Reddit author's frustration is legitimate, but the diagnosis is incomplete. The model is not objectively wrong; it is operating in an environment that lacks the scaffolding its architecture assumes. That gap between model capability and deployment readiness is increasingly the central engineering challenge in production AI, not the raw capability of the model itself.

Read original article →

Detailed Analysis

Don't Miss a Deploy