Detailed Analysis
Users of Claude Code have reported a notable behavioral shift beginning around version 2.1.128, in which per-session in-context token consumption accelerates significantly faster than in prior releases. Community members describe conversations reaching 50,000, 100,000, and even 200,000 tokens with what they perceive as disproportionate speed relative to the actual content exchanged. Crucially, this phenomenon appears isolated to raw context window burn rather than usage cap consumption or rate limiting, meaning the practical impact falls on conversation longevity — sessions becoming context-exhausted sooner — rather than on API billing metrics that users typically monitor. Even lightweight interactions involving minimal reasoning steps, no tool calls, no file reads, and short response outputs are reportedly consuming 2,000 to 5,000 tokens per exchange, a figure that strikes users as anomalously high.
The affected users have conducted basic diagnostic steps — including fresh session initialization and disabling all plugins — without finding relief, pointing toward a change at the infrastructure or harness level rather than any user-configurable factor. Speculation within the community centers on a few plausible architectural explanations. One hypothesis is that Claude Code began injecting additional metadata, formatting structures, or system-level context into each message turn. Another is that extended thinking or reasoning traces, which Claude models can generate internally, may now be persisted into the active context rather than being discarded after generation. If the latter is the case, it would represent a significant shift in how Claude Code manages its internal scratchpad, trading context efficiency for potentially improved coherence across multi-turn sessions.
This issue matters because context window size is a finite and operationally significant resource in agentic coding workflows. Claude Code sessions frequently involve extended task sequences — debugging loops, multi-file refactors, iterative code generation — that naturally accumulate substantial conversational history. If token burn per turn has structurally increased, users running long autonomous sessions will find themselves truncating or restarting context more frequently, disrupting task continuity and potentially degrading output quality as earlier relevant information is lost. Unlike rate limits, which reset on a schedule, context exhaustion mid-task requires active intervention from the user and can break complex agentic chains entirely.
The reported change connects to a broader tension in large language model deployment between capability enrichment and resource efficiency. As Anthropic has expanded Claude's extended thinking features, tool use frameworks, and agentic scaffolding, each enhancement carries the risk of increasing the implicit overhead baked into each inference call. System prompts grow larger as more capabilities are declared; reasoning traces add token mass; richer message formatting increases the per-turn footprint. Version-to-version changes in these underlying structures are often opaque to end users, who observe the downstream effects — faster context saturation — without access to the internals that explain them. This opacity is a recurring friction point in the developer community around AI coding tools, where users expect the kind of changelog granularity common in conventional software releases.
Anthropic has not, as of the discussion captured in this post, publicly acknowledged or explained the change, leaving users to reverse-engineer observed behavior without ground truth. This pattern reflects a broader challenge for AI companies shipping agentic products: the interaction between capability updates and resource consumption is complex, and changes that improve reasoning quality or contextual coherence may simultaneously impose invisible costs on context efficiency. As Claude Code matures and its user base grows more sophisticated, pressure for greater transparency around per-version token overhead, system prompt sizes, and reasoning trace handling is likely to intensify, particularly among power users managing long-horizon autonomous coding tasks.
Read original article →