Detailed Analysis
Claude Code's prompt caching architecture is generating significant user confusion and quota exhaustion, as illustrated by a Reddit post from a Pro subscriber who unexpectedly hit their 5-hour usage limit after a long-running session involving approximately 140 million cache-read tokens and repeated cache writes of around 475,000 tokens each. The user noticed that the `/usage` dashboard attributed consumption to "subagent-heavy sessions," sessions active for 8+ hours, and contexts exceeding 150,000 tokens — yet the subagent breakdown showed minimal activity, pointing toward cache write costs as the likely culprit. The user's core question — whether Claude Code intentionally weights long-context cache writes more heavily than equivalent API pricing, or whether this reflects an accounting bug — cuts to a genuine architectural tension in the product.
The underlying mechanism is well-documented in developer communities: when a Claude Code session goes idle for longer than the cache time-to-live window, the cached prompt state expires entirely. The next user message then forces a full cache miss, requiring the entire accumulated conversation context to be rewritten to cache simultaneously rather than incrementally. This is disproportionately expensive because writing to the one-hour cache costs approximately 100% more in token-equivalent terms than the base input price, whereas reading from cache costs only around 10% of that base price. In practical terms, a session with 900,000 tokens in context that idles for an hour can, upon resumption, consume a massive chunk of a Pro user's rate limit in a single message — not through computation or generation, but purely through cache reconstruction overhead.
Anthropic moved to address the problem in early March 2026 by reducing the default cache TTL from one hour to five minutes, a change intended to reduce the magnitude of single-event cache write explosions. However, user reports indicate the fix has had limited effect on quota exhaustion rates. The five-minute TTL means cache misses are now more frequent rather than less, and in active long-running sessions, incremental cache writes occur repeatedly throughout a session rather than in one large burst — potentially trading one type of exhaustion for another. Pro subscribers paying $20 per month have reported receiving as few as two viable prompts within a five-hour window under certain high-context conditions, a user experience gap that undermines the value proposition of the subscription tier.
The broader trend this incident reflects is the collision between AI coding assistants' natural usage patterns and the economic realities of transformer context handling. Claude Code is explicitly designed for deep, multi-turn, long-context workflows — pulling in codebases, running exploratory agents, and sustaining background automation — which are precisely the workloads that most aggressively interact with cache expiration costs. API pricing structures were built around discrete, stateless request patterns, while agentic developer tools require persistent, stateful session management. The mismatch has become more visible as context windows grow larger and developers push Claude Code into more ambitious use cases.
Anthropic faces a product design challenge that goes beyond simple pricing transparency: the opaque relationship between a Pro subscription's "5-hour limit" and the underlying token mechanics of cache writes makes it difficult for users to predict, manage, or reason about their consumption. The original poster's observation that a single 475k-token cache write "consumed a very large fraction" of their 5-hour limit — despite appearing inexpensive under public API rates — suggests either that Pro subscription limits are denominated in a non-linear way relative to API tokens, or that cache write costs are being allocated to rate limits at a higher multiplier than users expect. Without clearer documentation or tooling from Anthropic around cache write attribution in Claude Code sessions, developers will continue encountering surprising quota exhaustion, particularly as long-context agentic workflows become the norm rather than the exception.
Read original article →