90 days of hallucination rates on the same 42 recurring tasks across Sonnet 4.6, Opus 4.6, and Gemini 3 Flash fallback, running inside a RunLobster-hosted agent. The bridgebench 83.3 to 68.3 drop on Opus lines up with what I've been seeing since late March.

A solo founder tracking an AI agent's hallucination rates over 90 days observed Opus 4.6's rate triple from 0.14 to 0.38 hallucinations per briefing between mid-March and early April, matching the directional increase captured in the bridgebench benchmark during the same period. The degradation manifested in production as concrete false claims about dollar amounts, dates, and company details, prompting the operator to disable automatic escalation to Opus and route certain tasks exclusively to Sonnet 4.6 instead.

Detailed Analysis

A solo founder operating an always-on, OpenClaw-based agent has published 90 days of first-party hallucination data across 42 stable recurring task templates, offering one of the more methodologically disciplined user-generated accounts of Claude model drift to emerge from the r/Anthropic community. The agent routes between Claude Sonnet 4.6 as its default, Claude Opus 4.6 as an escalation tier, and Gemini 3 Flash as a rate-limit fallback, and the operator scores outputs daily on a 1–5 scale while spot-checking five outputs per day against source material for what he defines as "hallucinated specifics" — concrete factual claims (dollar amounts, dates, attributed quotes, titles) that cannot be verified against the documents present in context. The resulting data shows Opus 4.6's hallucination rate moving from a stable range of 0.09–0.14 per briefing across January through mid-March to 0.38 per briefing in the April 1–13 window — roughly a 2.7x increase — while Sonnet 4.6 edged upward only slightly and Gemini 3 Flash remained comparatively stable. The operator notes that this directional finding and approximate magnitude align with a separate benchmark retest published on bridgebench, which recorded an Opus accuracy drop from 83.3 to 68.3 over a similar timeframe, lending the anecdotal data independent corroboration.

The qualitative examples the operator provides distinguish this post from ordinary model-quality complaints. The errors he documents from early April — a CEO quote that does not appear in the linked article, a fictional $40M Series B attributed to a competitor, a fabricated customer name in a receipts reconciliation — are not ambiguous tone or reasoning failures of the kind he observed from the model in February. They are assertion-level confabulations about facts that were either contradicted by or entirely absent from source documents that were in the model's context window. This distinction matters because it suggests the regression is not a subtle shift in calibration or risk tolerance but a more fundamental change in how the model arbitrates between retrieved context and internally generated plausible-sounding content. The fact that the failures cluster around quoted-statement extraction, company-news synthesis against a source article, and receipts categorization — but not summarization or code — further implies that the degradation is task-class-specific rather than a uniform capability reduction.

The production consequences the operator describes illuminate a structural risk in tiered LLM routing architectures that has received relatively little systematic attention. When an operator designs a harness that automatically escalates to a premium model tier on the assumption that the premium tier is more reliable, a quality inversion — where the escalation-tier model becomes less trustworthy than the default on a meaningful subset of tasks — creates a compounding problem: the system pays premium-tier pricing precisely for the outputs most likely to be wrong. The operator's response — dropping his Opus fallback rate from 38% to 6% of calls and routing reconciliation and company-tracking tasks to Sonnet-only — reflects a rational adaptation, but it also represents a collapse of the reliability architecture he built his workflow around. His stated inability to determine whether the regression originates in weight changes, quantization, capacity-based routing pools, or system prompt modifications is itself a significant observation: production operators running always-on agentic systems against Anthropic's API have essentially no instrumentation into the source of model behavior changes short of their own longitudinal logging.

The broader context makes this post noteworthy beyond its specific findings. The research context around Claude Sonnet 4.6 suggests the model performs strongly on structured agentic benchmarks — 72.5% on OSWorld-Verified, high marks on ARC-AGI-1 — and Anthropic's own system card documentation describes reductions in false success claims and multi-step inconsistencies as design priorities. That Opus 4.6, positioned as the higher-reliability escalation tier, appears to have undergone a sharp factual-grounding regression in late March precisely in the task categories most central to high-precision synthesis work creates a tension with that positioning. The operator is careful not to overinterpret causation, but the convergence of his longitudinal first-party data with the bridgebench benchmark retest on timing, direction, and approximate magnitude is difficult to dismiss as coincidence, and the persistence of the pattern across two-plus weeks rules out a transient bad day or sampling artifact.

This account fits into a widening pattern of concern among developers running Claude in production agentic contexts about the opacity and unpredictability of model behavior changes on Anthropic's hosted API. As the AI industry moves toward always-on, task-autonomous agent deployments — where a model's factual reliability determines the integrity of downstream business records, competitive intelligence outputs, and financial categorizations — the gap between benchmark-level performance claims and longitudinally observed production reliability becomes a material operational risk. The fact that the only model in this operator's stack that held steady was Gemini 3 Flash, a Google model retained purely as a rate-limit escape valve, is an irony that underscores a broader infrastructure lesson: model quality tiers advertised by a single provider can erode asymmetrically, and production stacks that depend on a single provider's reliability hierarchy are structurally exposed to exactly the kind of inversion this operator experienced.

Read original article →

Detailed Analysis

Don't Miss a Deploy