Show HN: I ran every Claude agent turn through the Batch API

An engineer tested running Claude agent turns through Anthropic's Batch API to evaluate cost savings, finding that while the 50% discount works technically, it introduces severe latency of 90–120 seconds per turn, making five-turn interactions take approximately ten minutes. Although unsuitable for interactive single-agent scenarios, the Batch API could prove valuable as a hidden optimization layer for fleet-level deployments involving multiple parallel agents, background tasks, and CI jobs. The experiment suggests routing systems should measure batch performance across different model sizes rather than assuming cheaper models are better suited for batching.

Detailed Analysis

Anthropic's Batch API, which offers a 50% cost reduction over standard synchronous endpoints, became the subject of a hands-on engineering experiment in which a developer routed every turn of a Claude agent loop through the asynchronous batch processing system. The experiment, implemented as a minimal single-file Python REPL with a basic tool loop and local shell access, was motivated by the apparent financial appeal of applying batch pricing to agent workloads such as evaluations, background research tasks, and CI pipelines. The core finding was unambiguous: a single-entry batch consistently took between 90 and 120 seconds to complete, transforming a five-turn tool interaction into a roughly ten-minute ordeal. The latency made even trivial operations — such as the model deciding to invoke a shell command like `ls` — impractical at interactive speeds.

The experiment's value lies less in the negative result and more in the architectural insight it surfaces. The developer concludes that the individual agent turn is the wrong unit of analysis for batch economics. The correct framing shifts to fleet-level parallelism: scenarios where dozens or hundreds of agents run concurrently, where background subagents operate outside user-facing response windows, or where shared prompt prefixes across many harnesses can benefit from prompt caching. In those configurations, the latency penalty per-request becomes acceptable or even irrelevant, while the cumulative cost savings across thousands of requests become substantial. Anthropic's Batch API was explicitly designed for high-volume, latency-tolerant workloads, and the experiment essentially confirms that the design contract is real — violating the latency-tolerance assumption produces predictably poor results.

A secondary observation from the experiment carries meaningful implications for infrastructure design: Haiku, Anthropic's smallest and cheapest model, did not consistently produce faster batch completions than Sonnet or Opus in the developer's informal testing. This challenges a common intuition that cheaper, faster models are natural candidates for batch queues. If batch processing time is partly determined by queue dynamics and server-side scheduling rather than pure inference cost, then routing logic that assumes "cheap model equals fast batch" may be systematically misconfigured. The developer explicitly recommends that any routing layer measure actual batch latency per model empirically rather than relying on pricing tiers as a proxy for speed.

The broader architectural proposal — a transparent proxy sitting below existing agent harnesses that routes requests to synchronous or asynchronous endpoints based on latency tolerance — reflects a maturing pattern in production AI infrastructure. Rather than requiring individual agent frameworks to be rewritten for batch awareness, the proxy approach preserves compatibility with tools that expect standard API shapes while enabling infrastructure-level optimization. This mirrors how HTTP caches and load balancers operate in web infrastructure: invisible to application logic, but consequential for cost and throughput at scale. The experiment connects to wider trends in agentic AI deployment, where the economics of running many concurrent Claude agents — across platforms including AWS Bedrock and Google Vertex AI, both of which support Anthropic's batch interfaces — increasingly demand this kind of infrastructure-layer thinking rather than per-call optimization.

Read original article →

Detailed Analysis

Don't Miss a Deploy