Detailed Analysis
A Reddit user's benchmarking of Claude Code across Opus 4.6 and Opus 4.7 has surfaced a significant token usage discrepancy, with Opus 4.7 consuming between 5x and 8x more tokens than its predecessor when performing a codebase documentation task on a small Express/SQLite API repository of approximately 12 files and 500 lines of code. The user's hypothesis centers on what they describe as serialized tool execution: whereas Opus 4.6 batched multiple independent file reads into a small number of model requests (as few as 3), Opus 4.7 appears to issue one Read tool call per model request, generating 16–20 round trips for the same task. Because each model request re-reads the large cached Claude Code system context, cache-read tokens accumulate dramatically — from roughly 50,566 tokens across 3 requests with Opus 4.6 to 432,557 tokens across 16 requests with Opus 4.7. The user's JSONL transcript analysis shows a repetitive assistant→tool_result→assistant→tool_result loop that stands in contrast to the more parallelized behavior observed in 4.6.
However, the research context introduces an important complication: Anthropic's documentation and developer community discussions point to a tokenizer change in Opus 4.7 as a primary driver of elevated token counts, rather than confirming the serialization hypothesis as the definitive cause. Opus 4.7's new tokenizer processes the same source content using 1.0x to 1.35x more tokens than 4.6, with the highest impact on structured data like JSON (up to 35% more tokens) and source code (15–25% more), with minimal impact on plain English prose. This tokenizer shift means that even if the model's tool-call behavior were identical, raw token counts would be higher across the board. That said, the magnitude of the discrepancy reported — over 400% more cache-read tokens — likely cannot be fully explained by tokenizer differences alone, suggesting that behavioral changes in how Opus 4.7 plans and sequences tool use within Claude Code's agent loop are also a genuine contributing factor, even if the precise mechanism is still under investigation.
The compounding effect the user identifies is particularly important for understanding the cost and performance implications. Each serialized tool call does not merely add one file's worth of tokens; it adds a full re-read of the Claude Code system prompt and tool context, which the user estimates at 20,000–30,000 cached tokens per round trip. Across 15–20 tool requests, this produces hundreds of thousands of cache-read tokens for a trivially small repository. The user also raises a qualitative concern: between each file read, Opus 4.7 appears to generate extended reasoning about which file to read next and its progress, producing output tokens that neither solve the underlying problem nor feed useful information back into the task. This inter-step "thinking overhead" degrades not only efficiency but potentially answer quality, as the model's effective context window becomes increasingly saturated with redundant metacognition rather than file contents relevant to the documentation task.
The broader significance of this issue sits at the intersection of model capability and agentic deployment design. Claude Code's value proposition depends heavily on efficient multi-step tool use, and regressions in tool-batching behavior directly undermine the economics of using frontier models in coding workflows. Anthropic's documentation acknowledges that token efficiency in Opus 4.7 varies substantially by workload and that unchanged prompting strategies from 4.6 often require recalibration — recommending use of the `effort` parameter, task budgets, and conciseness-oriented prompt adjustments. Some benchmarks even suggest Opus 4.7 can be 50% cheaper than 4.6 in optimized setups, implying the efficiency gap is addressable but places the burden of optimization on developers rather than being handled transparently by the model or the Claude Code framework itself.
This episode reflects a recurring challenge in the deployment of increasingly capable large language models in agentic systems: more powerful models do not automatically translate to more efficient agents. The shift from Opus 4.6 to 4.7 introduces changes in tokenization, reasoning style, and tool-use behavior that interact in ways users cannot anticipate without careful benchmarking. The community-driven discovery of this regression — surfaced through JSONL transcript analysis rather than official documentation — underscores the growing importance of observability tooling in AI agent deployments and raises questions about how Anthropic communicates behavioral changes between model versions to developers who rely on consistent agentic performance as a foundation for production systems.
Read original article →