← YouTube

Give Me 10 Mins and I'll Save You Millions of Claude Tokens

YouTube · Nate Herk | AI Automation · May 21, 2026
Prompt caching reduces token costs to 10% of normal input for cached tokens, with the author reporting savings of 91 million tokens in one day and over 300 million in a week. The cache remains active for one hour on Claude subscriptions but only five minutes on API usage before all content must be recached. The caching system operates across three layers—system instructions, project context, and conversation messages—with each layer breaking the cache at different points when modified or after the time-to-live window expires.

Detailed Analysis

Prompt caching in Claude and Claude Code represents a significant but often underappreciated mechanism for reducing token consumption and operational costs for developers and power users. The article, framed as a practical tutorial, demonstrates that cached tokens are billed at only 10% of standard input token rates, meaning a session that processes 91 million cached tokens effectively costs the equivalent of processing roughly 9 million tokens. Over a week, the author claims savings exceeding 300 million tokens through this mechanism alone. Crucially, this caching occurs automatically within Claude Code and Claude interfaces, requiring no manual configuration from users under most circumstances.

The technical architecture of prompt caching operates across three distinct layers: a system layer containing global instructions, tool definitions, and output style parameters; a project layer housing files like Claude.md and project-specific memory; and a conversation layer that grows incrementally with each user turn. When a session begins, all content must be written to cache for the first time — a "cache create" event — but subsequent turns within the caching window reuse that stored context as cheaper "cache reads." The time-to-live (TTL) window is one hour for Claude Code running in terminal or extension environments, while API and sub-agent usage operates on a much shorter five-minute TTL by default, though this can be extended. If a user allows a session to go idle beyond the TTL threshold, or modifies the system prompt mid-session, the entire context must be re-cached from scratch — a costly reset that compounds in expense the deeper into a conversation a user has already traveled.

Anthropic's internal practices around caching underscore how seriously the company treats this optimization. The article cites Thoriq from Anthropic, who noted that the company runs active monitoring alerts on prompt cache hit rates and escalates internally — declaring "SEVs," or severity events — when hit rates fall below acceptable thresholds. This reveals that prompt caching is not merely a cost-saving feature for end users but is deeply integrated into Anthropic's infrastructure strategy: high cache hit rates reduce Anthropic's own serving costs, make Claude Code feel more responsive, and allow subscription session limits to stretch further, creating a mutually beneficial outcome for both the company and its users.

The broader significance of this tutorial reflects a growing reality in the AI development ecosystem: as large language model usage scales, token economics become a first-order engineering and financial concern rather than an afterthought. The gap between naive and optimized usage patterns can represent orders-of-magnitude differences in cost, particularly for teams running long agentic coding sessions with tools like Claude Code that involve rich system prompts and extensive project context. The emergence of community-built dashboards and tutorials focused specifically on token tracking signals that a class of sophisticated Claude power users is developing around infrastructure-level optimization, paralleling how cloud computing communities evolved around cost management tools for AWS and similar platforms. Anthropic's transparency through developer-facing communications about caching internals, such as the article referenced by the author, positions the company as relatively open about its system architecture compared to some competitors, which may accelerate this community-driven optimization culture.

Read original article →