Opus 4.8 High way more stupid than 4.7 with Extended/Adaptive Thinking on

A user reported that Opus 4.8 High failed to reproduce a calculation that Opus 4.7 had correctly completed, resulting in €50 spent on credits and 8 hours of troubleshooting before the model identified its error. The user determined that Opus 4.8 High performed worse than Opus 4.7 and cautioned that the high-tier default option proved insufficient for reliable task execution.

Detailed Analysis

A user posting to the r/Anthropic subreddit reports a significant and costly performance regression between Claude Opus 4.7 and Claude Opus 4.8 when using the Extended/Adaptive Thinking feature at the "High" compute tier. The user spent approximately eight hours and 50 euros in API credits attempting to replicate a calculation that had been completed correctly by Opus 4.7 in a prior session. Throughout the day, Opus 4.8 High repeatedly failed to reproduce the result, leading the user through extensive debugging before the model itself eventually acknowledged that it had been making errors — only after the user identified the problem independently. The user's core complaint is that the "High" tier of Opus 4.8, which is positioned as the standard default option for most users, produces materially worse outputs than its predecessor on at least some computational tasks.

The complaint highlights a practical and financially meaningful distinction in Anthropic's tiered compute offering. Anthropic's Claude Opus 4.8 appears to be available at multiple performance tiers — "High" and "Max" — with the Max tier presumably allocating greater computational resources for Extended Thinking. The user's frustration centers on the implicit promise of a product upgrade: when a model carries a higher version number, users reasonably expect that it performs at least as well as its predecessor across the tasks the prior version handled reliably. When that expectation is violated, the cost is not merely inconvenience — it is billable compute time spent on incorrect outputs that the user must diagnose without reliable guidance from the model itself.

This type of regression complaint is particularly pointed because it involves Extended or Adaptive Thinking, a feature explicitly designed to improve the quality and reliability of reasoning on complex tasks. If a capability intended to enhance deliberate, step-by-step reasoning produces demonstrably worse results in a newer model version at the default tier, it undermines the core value proposition of the feature. The user's observation — that Opus 4.7 succeeded where Opus 4.8 High failed on what appears to be a numerical or analytical calculation — suggests that the tiered resource allocation in 4.8 may have shifted what "High" actually delivers relative to the older model's baseline behavior.

More broadly, this complaint reflects a growing tension in the commercial deployment of large language models as providers introduce increasingly granular product tiers. When performance is stratified by compute allocation, users who cannot or choose not to pay for the maximum tier may encounter inconsistent quality, and version-to-version comparisons become complicated by the interaction between model capability and resource allocation. This creates a difficult communication challenge for Anthropic: users accustomed to a previous model's outputs at a given price point may find that the same price now buys meaningfully less capable behavior, even when the headline model version number is higher. The user's call to "beware" signals a trust issue that, if widespread, could influence adoption patterns for the intermediate product tier.

The incident also underscores a recurring challenge in AI reliability: models that fail silently or confidently produce incorrect results before eventually admitting error impose disproportionate costs on users. The user's report that Opus 4.8 High only acknowledged its errors after hours of user-driven debugging — rather than surfacing uncertainty proactively — points to a failure mode in which the model's self-monitoring lags behind its actual error rate. As Anthropic and other frontier AI developers continue tiering their products by compute, ensuring that lower-tier offerings maintain honest uncertainty calibration and do not degrade silently relative to prior versions will be central to maintaining user trust across the full customer base, not just those who purchase maximum-tier access.

Read original article →

Detailed Analysis

Don't Miss a Deploy