How Claude Code uses prompt caching - Claude Code Docs

Documentation Index Fetch the complete documentation index at: https://code.claude.com/docs/llms.txt Use this file to discover all available pages before exploring further. Prompt caching makes Claude Code faster and more cost-efficient. Without caching, the

Detailed Analysis

Claude Code's prompt caching architecture represents a deliberate engineering approach to managing the stateless nature of large language model APIs, where each interaction requires transmitting the full context of a session anew. Because the underlying API retains no memory between requests, Claude Code reconstructs and resends the entire conversation on every turn — including system instructions, project context, prior messages, and tool results. Prompt caching mitigates the performance and cost penalties of this repetition by matching the leading portion of each new request, the "prefix," against recently processed content stored server-side. Only content that follows the first point of deviation from the cached prefix must be recomputed, making normal conversational turns significantly cheaper and faster than they would otherwise be.

The layered ordering of request content is central to how Claude Code maximizes cache efficiency. Content is sequenced from least-frequently-changing to most-frequently-changing: the system prompt occupies the front, followed by project context such as CLAUDE.md files and memory, with the live conversation appended last. This ordering ensures that a new conversational exchange — the most common form of change — leaves the system prompt and project context layers fully cached and untouched. However, any modification higher in the stack invalidates all subsequent layers. A change to the system prompt, for instance, causes a full cache miss on everything that follows, because prefix matching is exact and sequential with no mechanism for per-segment or per-file granularity.

Several user actions carry hidden caching costs that are not immediately obvious in day-to-day operation. Switching models, connecting or disconnecting MCP servers, denying tool access by name, compacting the conversation, or upgrading Claude Code all trigger partial or full cache invalidation. Model switching is particularly consequential because each model maintains its own independent cache; identical content processed under a different model name yields zero cache hits. MCP server disruptions — whether caused by a process exit, session expiration, or a dynamic tool list update — alter the tool definitions embedded in the system prompt layer, forcing a recomputation of the entire cache from that point forward. These behaviors explain why certain configuration changes are deferred until session restart: applying them mid-session would impose a cache-rebuild penalty on an active workflow.

The documentation's treatment of caching infrastructure reveals that cache locality is determined by authentication method rather than being a uniform global service. Users authenticating via API key, Claude subscription, or Claude Platform on AWS have their caches stored in Anthropic's own infrastructure, while Bedrock and Vertex AI users rely on their respective cloud providers' serving layers. Custom base URLs or LLM gateway configurations introduce additional variability, as caching behavior becomes dependent on the forwarding infrastructure rather than on any guarantee from Anthropic. This fragmentation is significant for enterprise deployments where cost predictability and latency consistency are operational requirements, and it places the burden on administrators to verify whether their chosen routing path actually supports prompt caching.

Taken together, Claude Code's caching design reflects a broader pattern in production AI deployment: the gap between raw model capability and cost-effective, low-latency use at scale is increasingly closed not through model-level improvements alone, but through careful infrastructure engineering around how context is structured, sequenced, and reused. The decision to expose caching mechanics to developers — rather than treating them as purely opaque infrastructure — signals a maturation in how AI tooling documentation addresses production concerns. As AI coding assistants move deeper into professional development workflows, the performance economics of context management are becoming as relevant to practitioners as the quality of the model's outputs themselves.

Read original article →

Detailed Analysis

Don't Miss a Deploy