Built an MCP proxy that killed my context bloat AND my RAM usage — here's how

A developer created an MCP proxy gateway to solve the inefficiency of running multiple AI coding agents that each spawned duplicate MCP server processes, consuming approximately 4 GB of RAM and generating 50,000 context tokens at startup. The gateway consolidates all servers into a single HTTP daemon, reducing MCP processes from 35 to 10, RAM usage to 1.3 GB, and startup tokens to 375 through schema deferral and response shielding, while requiring only a single configuration entry.

Detailed Analysis

A developer identified as HarshalRathore published an open-source MCP (Model Context Protocol) gateway designed to solve two compounding inefficiencies that emerge when running multiple AI coding agents — specifically pasting (pi), VS Code, and opencode — alongside a shared fleet of MCP servers. The core problem was process duplication: each agent session independently spawned its own complete set of MCP server processes, resulting in approximately 35 concurrent npm exec processes consuming roughly 4 GB of RAM. Compounding the resource waste, every new agent session loaded full tool schemas from all connected servers into context at startup, burning approximately 50,000 tokens before any user prompt was entered. The proxy addresses both issues simultaneously, reducing the process count to roughly 10, RAM consumption to approximately 1.3 GB, and startup context cost to around 375 tokens — a 99.3% reduction in schema-loading overhead.

The architectural approach relies on three interlocking mechanisms. First, schema deferral replaces upfront schema loading with a keyword-based discovery model: the agent exposes just six meta-tools (search, describe, invoke, and supporting utilities) rather than the full schema surface of every upstream server, loading detailed schemas on demand only when a specific tool is actually needed. Second, response shielding applies automatic truncation and pagination to large payloads, preventing individual tool responses from flooding context with data the agent didn't explicitly request. Third, and most structurally significant, a shared HTTP daemon mode runs all upstream MCP servers as a single systemd user service, allowing every agent — regardless of which IDE or runtime — to connect remotely rather than each one independently bootstrapping its own process fleet. The result is that a single ~/.pi/agent/mcp.json or .vscode/mcp.json entry replaces what previously required a dozen or more individual server configurations.

The project reflects a broader and increasingly urgent problem in the MCP ecosystem: as tool counts scale, the naive approach of loading all available schemas into context at session initialization creates a fixed and substantial token tax that is paid regardless of which tools are actually used in a given session. The Model Context Protocol, which Anthropic introduced and open-sourced as a standard for connecting AI agents to external data sources and tools, was designed to reduce fragmentation in custom integration development. However, it did not inherently solve the question of how agents should manage schema visibility at scale — that responsibility has fallen to implementers. HarshalRathore's gateway represents one practical answer: treat the tool catalog as a searchable index rather than an always-loaded registry, borrowing the BM25 retrieval model (via MiniSearch) to surface relevant tools at query time rather than context-initialization time.

The RAM and process reduction is significant in resource-constrained development environments — particularly those running local inference or local tool servers on consumer hardware — but the more consequential optimization is the token one. In agentic workflows, context window consumption at session start directly constrains how much reasoning, history, and task content can fit within a single session. A 50,000-token schema overhead in a 200,000-token context window represents 25% of available capacity consumed before work begins; reducing that to 375 tokens is functionally equivalent to giving the agent a substantially larger working memory. As AI coding workflows grow more complex and tool ecosystems expand to include dozens of servers, this kind of lazy-loading pattern for tool schemas is likely to become a standard architectural expectation rather than an optional optimization.

The project also surfaces a latent architectural tension in the current MCP landscape: the protocol assumes a relatively flat, always-visible tool surface, while real-world deployments are trending toward larger, more heterogeneous tool sets that benefit from hierarchical or retrieval-based discovery. Solutions like this gateway are early indicators that the ecosystem may need to develop standardized conventions — or protocol-level extensions — for tool discoverability at scale, rather than relying on individual developers to build bespoke proxying layers. Whether through community convergence around patterns like this one or through future iterations of the MCP specification itself, the problem of context-efficient tool management is becoming a first-order concern in production agentic deployments.

Read original article →

Detailed Analysis

Don't Miss a Deploy