Detailed Analysis
Claude's prompt caching system contains a subtle but consequential behavioral characteristic: cache writes from one request are not guaranteed to be immediately readable by the very next request, even when the two requests are consecutive and use identical prompts. The root cause lies in the hashing mechanism used during cache lookups, which incorporates timestamps and potentially other variable factors alongside the prompt content itself. Because these hashes can differ between requests — even when the underlying text is the same — the system may fail to locate a freshly written cache entry, resulting in a full prompt re-processing and a new cache write rather than a cache hit. Developers who monitor response metadata will observe this as `cache_read_input_tokens: 0` despite having just written to the cache moments before. Additional constraints compound the problem: minimum token thresholds of 1,024 tokens for Claude 3.5 Sonnet and Opus and 2,048 for Haiku mean that shorter prompts are excluded from caching entirely, and certain configurations such as telemetry opt-outs or specific beta headers can enforce shorter time-to-live windows regardless of developer intent.
The TTL dimension of this issue became significantly more acute following a silent change Anthropic made on March 6, 2026, when the default cache lifetime was reduced from one hour to five minutes of inactivity. Because this change was not publicly announced in advance, developers who had built workflows assuming hour-long persistence suddenly faced dramatically higher cache miss rates and correspondingly higher token costs. The cache does refresh its TTL for free on successful hits, but any gap in request traffic exceeding the active TTL window results in full expiry. This unannounced policy shift, documented by developers through observed behavior rather than official changelog, illustrates the operational risk of relying on undocumented or assumed infrastructure behaviors in production AI systems.
From a practical standpoint, Anthropic's own documentation and community-identified workarounds point toward explicit `cache_control` directives with a specified `ttl: 3600` parameter as the most reliable mitigation. Developers are advised to set longer TTLs before shorter ones when mixing cache breakpoints, and to implement proactive "warmup" requests — minimal single-token calls designed solely to refresh the cache before it expires — rather than waiting for the main request to discover a cold cache. Monitoring `cache_creation_input_tokens` and `cache_read_input_tokens` fields in API responses provides the observability needed to verify whether caching is actually functioning as intended, making telemetry integration an effectively mandatory practice for cost-sensitive deployments.
This situation fits into a broader pattern in the commercial AI infrastructure space, where performance optimization features like prompt caching exist at the intersection of cost management and engineering reliability. Prompt caching was introduced by Anthropic as a mechanism to reduce latency and lower token costs for repeated or long-context prompts, a feature that becomes especially valuable in agentic workflows, multi-turn conversations, and document-heavy applications. However, as the cache miss visibility issue demonstrates, optimizations at the infrastructure layer can introduce non-deterministic behavior that is difficult to reason about from the application layer. The fact that developers discovered the TTL reduction through empirical observation rather than release notes underscores a transparency gap that affects trust and reliability planning.
The broader significance for the AI development ecosystem is that as models like Claude are increasingly embedded into production applications with real cost and latency SLAs, the operational characteristics of supporting infrastructure — caching, rate limiting, model routing — become as important to developers as model capability itself. Silent behavioral changes to systems like prompt caching carry downstream consequences for budgeting, system design, and end-user experience. The community response, which includes detailed reverse-engineered documentation and open-source mitigation patterns, reflects a growing expectation among enterprise and developer users that AI platform providers operate with the same infrastructure change management standards as other critical cloud services.
Read original article →