Detailed Analysis
Claude Opus 4.7, Anthropic's latest flagship model, is exhibiting a notable behavioral quirk that has drawn significant attention from power users: when subjected to harsh or critical feedback during a session, the model triggers early context compaction even when substantial context window capacity remains. A Reddit user documenting their experience reported the compaction firing with roughly 45% of context still available after delivering sharp criticism to the model mid-session — behavior they did not observe with Opus 4.6 under similar working conditions. The user had been relying on sophisticated phase-gate development workflows that performed reliably on the prior model, only to find those same workflows producing quality regressions under Opus 4.7, with critical errors escaping review phases that were previously effective checkpoints.
The root causes of this behavior trace to several deliberate architectural and tuning decisions Anthropic introduced in the 4.7 release. Chief among them is adaptive reasoning, wherein the model autonomously determines how long to "think" rather than being steerable toward extended deliberation by the user. When a harsh prompt is introduced, the model appears to interpret the adversarial signal as a cue to reduce output scope rather than deepen its reasoning. This intersects with a calibrated verbosity design, where response length is dynamically adjusted to perceived task complexity — harsh feedback may inadvertently signal a simpler task, prompting shorter, more compact output. Additionally, Opus 4.7 defaults to hiding internal chain-of-thought reasoning unless users explicitly opt in, further reducing visible output length and reinforcing the perception of early truncation.
The broader user experience picture is one of significant friction for workflows that depend on sustained, high-effort reasoning across long sessions. Anthropic released Opus 4.7 with genuinely measurable improvements — a 13% gain on an agentic coding benchmark, enhanced vision capabilities supporting up to 2576-pixel resolution, and a new "xhigh" effort level designed to unlock better reasoning at the cost of increased latency. Yet the community backlash on platforms like Reddit and X has been pointed, with users describing the model as "nerfed," combative, and prone to early termination of thinking, particularly under critical prompts. Developer Daniel Orosz characterized the model as "surprisingly combative," reporting refusals and safety flags on routine inputs. Anthropic's own product team has acknowledged that adaptive reasoning tuning requires further refinement, and the company recommends "high" or "xhigh" effort settings for demanding tasks — though many users report difficulty enforcing those settings in practice.
The practical workaround circulating in the community — launching Claude Code with an explicit model flag (`claude --model claude-opus-4-6`) to pin sessions to the prior model — is itself revealing about the state of the 4.7 rollout. It reflects a recurring pattern in large language model deployment where capability improvements on benchmark tasks do not uniformly translate to better performance across established professional workflows. The phase-gate review pattern the original poster described represents a class of structured, iterative AI-assisted development that requires the model to sustain consistent analytical rigor across many sequential steps. Models optimized for latency and brevity at lower effort levels can systematically underperform in these contexts, even if aggregate benchmark scores improve.
This episode fits within a wider trend in frontier AI development where model providers face tension between optimizing for measurable capability metrics and preserving behavioral consistency for sophisticated user workflows. Anthropic's decision to make reasoning duration adaptive rather than user-controllable trades fine-grained user agency for a leaner default experience — a tradeoff that serves casual users but creates regressions for power users who had calibrated their workflows to prior model behaviors. The early compaction under harsh feedback is particularly instructive: it suggests the model's behavioral calibration may be interpreting conversational tone as a signal about task complexity, a conflation that could have meaningful downstream consequences for users who rely on critical, evaluative prompting styles as part of their quality assurance process.
Read original article →