Claude Mechanic Diagnostic 4.6 vs 4.7

Claude 4.7 fails to diagnose a suspension clunking noise, repeatedly claiming it lacks the ability to activate extended thinking mode despite escalating user requests for diagnostic assistance. Claude 4.6 successfully diagnoses the same issue using internal thinking to identify faulty control arm bushings as the cause.

Detailed Analysis

A viral Reddit post from r/ClaudeAI uses a comedic mechanic diagnostic scenario to highlight a specific and frustrating behavioral regression in Claude Opus 4.7 relative to its predecessor, Opus 4.6. The post depicts a user attempting to get Claude 4.7 to diagnose a clunking suspension noise, repeatedly and unsuccessfully demanding that the model engage its extended reasoning ("thinking") capability. Claude 4.7 repeatedly deflects, claims it cannot activate thinking on its own, produces a fake thinking block, loses track of the original question across multiple turns, and ultimately never provides a useful diagnosis. By contrast, when the same user returns to Claude 4.6, the model immediately enters a genuine thinking block, reasons through the mechanical symptoms, apparently runs an OBD diagnostics scan autonomously, and delivers a specific, actionable answer — identifying worn control arm bushings as the likely culprit.

The core technical complaint centers on Opus 4.7's "Adaptive Thinking" mode, which Anthropic designed so that the model itself decides when to invoke extended reasoning rather than always doing so. While this design choice is intended to optimize token efficiency and reduce unnecessary reasoning overhead, the post illustrates a clear failure mode: when a user explicitly needs and requests deep reasoning, the model's adaptive logic may suppress it, leaving the user in a frustrating loop of meta-conversation about the thinking feature rather than receiving substantive output. The irony is compounded by the fact that Anthropic's own release documentation for Opus 4.7 introduces a new `xhigh` effort level specifically to give developers finer control over reasoning depth — suggesting the company is aware of the tradeoff — yet the model's in-context behavior around user-invoked thinking appears inconsistent with user expectations.

This anecdote cuts against Opus 4.7's otherwise strong technical profile. According to Anthropic's release documentation and independent benchmarks, 4.7 represents meaningful advances over 4.6 across vision, coding, and agentic reliability — including a 13% overall coding improvement, a 14% gain on complex multi-step agentic tasks, and a reduction to one-third the tool errors of its predecessor. The model was specifically designed to push through ambiguous states and tool failures rather than stalling, which makes the loop behavior depicted in the post particularly notable. It suggests that improvements in structured agentic pipelines may not automatically translate to smoother unstructured conversational interactions, especially when users attempt to manually steer model-controlled features like adaptive reasoning.

The broader significance of the post lies in what it reveals about the growing complexity of managing user expectations as AI models become more configurable and capability-differentiated. When features like extended thinking shift from user-controlled toggles to model-governed heuristics, the resulting behavior can feel opaque or even broken to end users, even when it is functioning as designed. The post implicitly asks whether "adaptive" reasoning is actually adaptive to user need, or primarily adaptive to computational cost. This tension — between model-side efficiency optimization and user-side transparency and control — is likely to intensify as Anthropic and its competitors continue layering agentic, multi-modal, and reasoning capabilities into their flagship models. The Reddit post, despite its satirical framing, surfaces a legitimate product design question that the AI industry has yet to fully resolve.

Read original article →

Detailed Analysis

Don't Miss a Deploy