Opus 4.8 in default settings getting stuck thinking for a really long time

A user reported that Claude 4.8 with default high effort settings was taking over 10 minutes to complete basic tasks, appearing stuck in thinking mode, while Claude 4.7 completed similar tasks in under one minute. Performance remained problematic even after reducing effort levels and adjusting context window settings. The user expressed frustration that 4.8 represented a productivity downgrade compared to 4.7.

Detailed Analysis

A user posting to the r/ClaudeCode subreddit reports significant performance degradation with Claude Opus 4.8 under default high-effort settings, describing the model as getting "stuck" in extended thinking loops that consume more than ten minutes on tasks the poster characterizes as basic. The complaint centers on Claude Code, Anthropic's agentic coding environment, where the newer Opus 4.8 model appears to over-invest computational effort relative to task complexity. Attempts to mitigate the behavior by reducing effort settings to medium and adjusting the context window to one million tokens did not resolve the problem. By contrast, reverting to Claude Opus 4.7 with extended thinking enabled yielded completion times under one minute for the same tasks, leading the user to conclude that 4.8 represents a practical productivity regression despite presumably representing a technical advancement.

The core issue the post surfaces is a mismatch between raw model capability and practical usability in agentic workflows. Extended thinking features — where models allocate additional internal reasoning steps before producing output — are designed to improve accuracy on complex, multi-step problems. However, when these features engage disproportionately on straightforward tasks, the result is a latency penalty that undermines the utility of the system in real-world development environments. The user's observation that 4.8 is "getting lost in the sauce" captures a genuine tension in large language model deployment: increasing a model's capacity for deep reasoning does not automatically improve its ability to calibrate when deep reasoning is warranted.

This complaint connects to a broader challenge Anthropic and other frontier AI labs face as they scale models with extended or chain-of-thought reasoning capabilities. Models like OpenAI's o-series and Anthropic's own extended-thinking variants have demonstrated that longer internal reasoning chains improve benchmark performance on difficult tasks, but the inference cost — both in time and compute — scales accordingly. The critical engineering problem becomes teaching models to match reasoning depth to task difficulty dynamically, rather than defaulting to maximal deliberation. User reports like this one suggest that calibration remains an unsolved practical problem even as raw capability metrics improve.

From a product and competitive standpoint, the report highlights the risk of releasing models where capability improvements on benchmarks do not translate cleanly into improved user experience in production agentic settings. Claude Code competes directly with tools like GitHub Copilot Workspace, Cursor, and other AI coding assistants where speed and responsiveness are primary user expectations alongside accuracy. A successor model that is slower on routine tasks than its predecessor — regardless of its performance ceiling on complex problems — creates real friction for users who have built workflows around the previous generation. The fact that the user explicitly states 4.7 "was actually working for us" signals potential churn risk and suggests the gap between controlled evaluation environments and real-world deployment conditions remains a significant challenge for Anthropic's model release pipeline.

Read original article →

Detailed Analysis

Don't Miss a Deploy