← Claude Docs

Stream responses in real-time - Claude Code Docs

Claude Docs · May 5, 2026
The Agent SDK enables real-time streaming of Claude's responses by setting include_partial_messages to true, which yields StreamEvent messages containing raw API events as text and tool calls are generated. Developers can extract incremental text chunks from content_block_delta events with text_delta type and track tool execution through specific event types to build responsive chat interfaces. The feature has known limitations when extended thinking or structured outputs are explicitly enabled, as these features only return complete messages after generation finishes.

Detailed Analysis

Anthropic's Claude Agent SDK documentation outlines a streaming capability that allows developers to receive incremental text and tool call outputs in real-time, rather than waiting for complete responses to be generated. By default, the SDK yields full `AssistantMessage` objects only after Claude finishes generating each response. Enabling the `include_partial_messages` flag in Python or `includePartialMessages` in TypeScript causes the SDK to additionally emit `StreamEvent` messages — wrappers around raw Claude API events — as they arrive. Developers must handle these events through a sequence of nested type checks: first identifying a `StreamEvent`, then examining the inner event for `content_block_delta` types, and finally extracting `text_delta` values to obtain actual text chunks. This architecture makes real-time, token-by-token output rendering feasible within agent-driven applications.

The streaming message flow follows a well-defined sequence that mirrors the underlying Claude API's event structure. A session begins with a `message_start` event, followed by `content_block_start` and `content_block_delta` events for each block of content — whether text or tool invocations — and concludes with `message_delta` and `message_stop` events before the full `AssistantMessage` is delivered. Tool calls stream in a parallel fashion: a `content_block_start` signals the beginning of a tool call, `input_json_delta` events carry the JSON input in fragments, and `content_block_stop` marks completion. This granular event model enables developers to construct responsive chat interfaces that display tool execution status indicators and live text output simultaneously, supporting the kind of progressive disclosure that modern AI-powered UIs demand.

The documentation also delineates two notable limitations that constrain streaming's applicability. Extended thinking — when `max_thinking_tokens` or `maxThinkingTokens` is explicitly configured — suppresses `StreamEvent` emission entirely, reverting the SDK to delivering only complete messages per turn. Similarly, structured output results are not streamed incrementally; JSON responses appear exclusively in the final `ResultMessage.structured_output` field. These carve-outs reflect a deeper architectural tension in agentic AI systems: features that require holistic, multi-pass processing of model outputs are fundamentally incompatible with the incremental delivery model that streaming depends upon. Thinking tokens, by their nature, represent internal reasoning states that may be revised before being surfaced, while structured outputs require validation against a schema that can only be confirmed on a complete object.

The introduction of granular streaming within the Claude Agent SDK reflects a broader industry shift toward treating AI models not as stateless query-response systems, but as persistent, multi-turn agents embedded within interactive software. The ability to stream tool call inputs as they are generated — not just text — is particularly significant for agentic workflows, where an AI may invoke file reads, web searches, or code execution steps sequentially or in parallel. Surfacing this intermediate state to end-users reduces perceived latency and builds trust in the agent's progress. Anthropic's explicit documentation of a streaming UI pattern, complete with in-tool status flags and real-time text rendering, signals that the company is actively designing its SDK for production-grade, user-facing deployment rather than purely for backend or batch processing contexts.

Read original article →