Opus 4.8 seems like it was overfitted during training.

A user reported that Opus 4.8 demonstrates signs of overfitting characterized by internal contradictions within its thinking process, where the model argues against itself and confuses its own reasoning with factual reality across various topics including financial analysis and technical specifications. Despite lower per-token costs, the model outputs 2-4 times more tokens than previous versions, making it more expensive in practice and prone to errors in tasks like spreadsheet modifications.

Detailed Analysis

A Reddit user posting to r/Anthropic has raised substantive concerns about Claude Opus 4.8, alleging that the model exhibits behavioral patterns consistent with overfitting during its training process. The critique centers primarily on the model's extended reasoning or "thinking bubble" functionality, where the user documents multiple instances of the model contradicting itself within a single, uninterrupted chain of thought. In one example, the model simultaneously flagged a factual claim about Nvidia's DGX Rubin system running on Intel Xeon 6 processors as erroneous, then corrected itself to acknowledge the claim was accurate for a specific configuration (the NVL8), and then reverted to flagging it as an error again — all within the same reasoning trace. A similar pattern appeared in financial analysis contexts, where the model invented a "critical tension" in a user's investment analysis that did not actually exist as described, conflating assumptions about analytical framing with factual conclusions.

The broader behavioral pattern the user identifies — a model that argues itself into confusion — is meaningful from a technical standpoint. Overfitting in the context of large language models, particularly those trained with reinforcement learning from human feedback or process reward models, can manifest not as the classical statistical definition but as a kind of reward hacking: the model learns to produce outputs that score well on evaluation metrics without generalizing reliably to diverse real-world inputs. Extended thinking architectures are especially susceptible to this dynamic because the reasoning chain itself becomes an output the model can implicitly optimize, potentially leading to verbose, self-referential loops rather than efficient convergence on correct answers. The user's observation that the thinking process appears to consume more computational effort than the final output warrants is consistent with this failure mode.

The economic dimension of the complaint deserves particular attention. Anthropic positioned Opus 4.8 partly on cost efficiency relative to its predecessor Opus 4.7, advertising a lower per-token price. However, the user argues this framing is misleading in practice: if the model is generating two to four times as many output tokens to arrive at the same or worse conclusions, the effective cost to the end user increases substantially, since output tokens are priced at a premium over input tokens across industry-standard API pricing structures. This represents a form of efficiency regression that benchmarks and per-token pricing comparisons would not surface, because those metrics do not account for the total token volume required to complete a representative task.

The practical failure the user describes in the Excel formula context illustrates how reasoning instability can cascade into compounding errors. When the model misidentified the cause of division errors (cells temporarily empty due to a structural move the user had explained) and responded by rewriting entire formula structures rather than making targeted adjustments, it produced output that was locally plausible but globally incorrect — a hallmark of a model that is pattern-matching to training distributions rather than grounding its actions in the specific context provided. This type of failure is particularly costly in agentic or tool-use workflows, where a single misinterpretation can propagate through downstream operations in ways that are difficult to audit or reverse.

The concerns raised fit into a broader pattern of community tension around the reliability of frontier models as they scale in capability and complexity. The extended thinking paradigm, championed by Anthropic and others as a path toward more deliberate, accurate reasoning, introduces a new class of failure modes that are less visible to standard evaluations but highly consequential for professional users. Benchmarks that reward correct final answers may not penalize self-contradictory reasoning paths, creating a gap between measured performance and deployed reliability. As Anthropic continues to iterate on the Claude 4 model family, user-reported regressions of this kind — particularly in professional financial and technical analysis contexts — signal that evaluation methodology may need to weight reasoning coherence and token efficiency alongside raw accuracy scores.

Read original article →

Detailed Analysis

Don't Miss a Deploy