4.7 is a step up for serious workflows and a step down otherwise

Claude 4.7 interprets instructions more literally than its predecessors, eliminating the tendency to infer user intent inappropriately or fill in ambiguities without requesting clarification. A developer with 30+ custom skills found this change significantly improved productivity, as the system now properly escalates decisions and follows explicit instructions precisely rather than making baseless assumptions. Those experiencing regressions with 4.7 may have been relying on the system's previous inferential approach, which functioned less predictably with ambiguous prompts.

Detailed Analysis

Claude Opus 4.7 has emerged as a notably more disciplined instruction-follower compared to its predecessors, a shift that is generating sharply divided reactions among users depending on how they interact with the model. The article's author, who operates an elaborate agentic workflow built on more than 30 custom skills and an extensive documentation catalog, reports that chronic problems with Claude Opus 4.6 — specifically its tendency to infer user intent beyond what was explicitly stated and to "fill in the blanks" rather than escalate ambiguities — disappeared with the transition to 4.7. Custom subagents now behave as designed, escalating uncertainty to the main agent, reading affected files before executing refactors, and declining to resolve ambiguities not addressed in the prompt instructions. The author frames this as a dramatic productivity gain achieved in just three days, attributing prior friction entirely to the model's overconfident generalization rather than to any deficiency in the user's prompting.

The behavioral shift the author describes is confirmed by Anthropic's own release notes and independent benchmarking. Claude Opus 4.7 scores 64.3% on SWE-bench Pro — up from 53.4% in version 4.6 — and demonstrates 14% better performance at fewer tokens alongside a one-third reduction in tool errors. Anthropic explicitly designed 4.7 to be more literal, more self-verifying, and more role-faithful in team-style agentic setups. The model also introduces task budgets and an "xhigh" effort tier for compute-intensive reasoning, and it now adaptively allocates thinking based on task complexity rather than applying uniform verbosity across all queries. These changes collectively represent a deliberate architectural philosophy: the model should execute instructions as written rather than interpret them charitably or expansively.

The author's observation that perceived regressions stem from "sloppy and ambiguous prompts" is both provocative and analytically significant. It frames the model's stricter literalism not as a failure but as a mirror — one that surfaces the imprecision of prompts that earlier models papered over with inference. For power users who have invested in structured, explicit instruction sets, this is a feature. For casual users accustomed to models that treat a rough sketch as sufficient input, the same behavior reads as a step backward in helpfulness. The research context corroborates this duality: shorter, calibrated responses and reduced automatic tool calls mean that lightweight conversational workflows now require more explicit prompting effort to achieve the same output that 4.6 would have volunteered unprompted.

This tension reflects a broader and largely unresolved question in commercial AI development about whether models should optimize for perceived smoothness or for verifiable correctness. The generative AI field has long wrestled with the tradeoff between a model that "just works" through liberal interpretation — producing fluent, contextually plausible outputs — and one that demands precision but delivers predictability. Anthropic's trajectory with 4.7 suggests the company is deliberately moving toward the latter, at least for its Opus-tier offering, treating the model as infrastructure for professional workflows rather than as a conversational assistant. The simultaneous improvements in vision resolution, document parsing, and agentic reliability reinforce this positioning: 4.7 is engineered for users who build systems on top of Claude, not merely users who chat with it.

The real-world implications extend beyond individual user preference. As enterprises increasingly deploy LLMs as components within larger automated pipelines — where an errant inference can cascade into costly downstream errors — the value of a model that asks rather than assumes becomes structural rather than stylistic. The author's architecture of 30-plus custom skills designed to enforce a specific, futureproofed methodology is an early example of the kind of sophisticated human-in-the-loop system that 4.7 appears purpose-built to support. If Anthropic sustains this behavioral direction across future releases, it signals a maturing segmentation of the AI model market: one tier optimized for frictionless casual interaction, another for disciplined, high-stakes agentic deployment where prompt precision is both expected and rewarded.

Read original article →

Detailed Analysis

Don't Miss a Deploy