How Bad MCP design cost your Agent 5× more tokens

A benchmark test comparing two MCPs with identical functionality found that one consumed nearly 5 times more input tokens than the other despite achieving the same 90% success rate. The less efficient MCP suffered from incomplete query results requiring extra agent calls, unfiltered raw API data dumped into the context window, and an excessive tool count of 47 versus 14, all driving up token usage and agent steps. Effective MCP design involves returning complete context for subsequent actions, minimizing overlapping tools, and formatting API responses in LLM-friendly formats rather than passing raw JSON.

Detailed Analysis

Model Context Protocol (MCP) tool design quality has a measurable and significant impact on LLM agent performance, as demonstrated by a controlled experiment comparing two MCP servers built for identical to-do list functionality. The author constructed MCP-A independently and then compared it against MCP-B, the application's officially released MCP server, running both against the same backend API and the same account data. Using an evaluation framework called MCP-Eval with the MiniMax-M2.7 model across 40 standardized test prompts, the results were striking: both servers achieved the same 90% pass rate, yet MCP-B consumed nearly five times more input tokens (3.17 million versus 637,000) and required 29% more agent steps to complete equivalent tasks. The performance gap was not a product of capability differences but entirely a consequence of design decisions.

Three specific anti-patterns drove MCP-B's inefficiency. First, its query tools returned incomplete data — the search tool omitted the `project_id` field that subsequent CRUD operations required, forcing the agent into an additional tool call on every relevant task interaction. Second, MCP-B passed raw API responses directly into the context window without filtering, including fields like `sortOrder`, `etag`, `focusSummaries`, `columnName`, and numerous null values that carry no semantic value for task completion. A single `create_task` response ballooned to 600+ characters of effectively inert data. Because agent sessions accumulate context across multiple loops, this padding compounds with each step, widening the token gap progressively over longer sessions. Third, MCP-B exposed 47 tools compared to MCP-A's 14, enlarging the model's decision space and increasing the probability of incorrect tool selection, which in turn generated retry loops and additional output tokens.

The findings reinforce a set of design principles that distinguish good MCP architecture from merely functional MCP architecture. Tools should be designed with downstream agent actions in mind — returning not just the data directly requested but the contextual identifiers and attributes the agent will predictably need next. Tools should also be orthogonal and composable, using parameterization to consolidate overlapping functionality rather than creating separate tools for each query variant. Critically, MCP servers should function as a semantic translation layer between raw API responses and LLM-readable output, stripping irrelevant fields and presenting structured, human-readable text rather than raw JSON. These are not merely stylistic preferences; they have quantifiable cost and performance implications at scale.

This experiment sits within a broader conversation about the engineering discipline required to build effective agentic systems. As MCP becomes a dominant standard for exposing external tools to LLMs — with adoption accelerating across enterprise software and developer tooling — the quality variance between implementations is becoming a meaningful operational concern. Token costs, latency, and context window exhaustion are real constraints in production deployments, and poorly designed tool interfaces directly translate to higher infrastructure costs and degraded reliability. The fact that an officially released MCP server from an application vendor performed significantly worse than an independently built alternative suggests that tool design best practices are not yet widely internalized, even among developers shipping production-grade integrations.

The broader implication is that MCP quality will increasingly differentiate AI-powered workflows as agentic applications mature. Organizations and developers building on top of third-party MCP servers have limited visibility into these design choices, yet bear their performance consequences. This points toward a need for standardized MCP evaluation tooling, design guidelines, and potentially quality benchmarks that go beyond functional correctness — assessing efficiency, context footprint, and decision-space complexity as first-class metrics alongside task pass rates.

Read original article →

Detailed Analysis

Don't Miss a Deploy