← Reddit

I built Tarn — API tests Claude Code can write, run, and debug end-to-end (open source, MCP server included)

Reddit · nazarkk · April 26, 2026
Tarn is an open-source API testing tool designed to help Claude Code and other AI agents write, run, and debug tests end-to-end by providing structured JSON output with stable failure categories instead of human-readable error messages. Built as a CLI-first tool with an accompanying MCP server, Tarn uses YAML-formatted test files and returns detailed failure information including offending requests and responses, enabling agents to reliably identify and fix issues rather than guessing from unstructured stderr. The project was developed collaboratively through paired sessions with Claude Code, with the final format and tool design refined based on what the model could generate without errors.

Detailed Analysis

Tarn, an open-source CLI-first API testing tool developed by Nazar Kalytiuk in collaboration with Claude Code, addresses a structural deficiency in how AI coding agents interpret test failure output. The tool was born from a concrete frustration: agents like Claude Code, Cursor, and Windsurf were unable to reliably distinguish between categorically different failure types — a 404, a connection refusal, and an assertion mismatch on a changed response body — because all three surfaced as human-readable stderr prose that the model had to interpret through guessing. Tarn's solution is to replace that prose with structured JSON output carrying a stable `failure_category` field, an `error_code`, the full offending request and response, and a set of remediation hints. This transforms failure data from something a language model infers into something it can branch on programmatically, enabling more reliable automated debugging loops.

The tool's companion MCP server, `tarn-mcp`, is the centerpiece of its agentic integration strategy. By exposing the full write-run-debug loop as discrete MCP tools — including `tarn_run`, `tarn_validate`, `tarn_fix_plan`, `tarn_inspect`, and `tarn_rerun_failed` — Tarn allows Claude Code to drive API test workflows entirely through structured tool calls rather than shell output parsing. This is a meaningful architectural distinction: tool-call interfaces give agents deterministic, typed responses and eliminate the ambiguity that arises when parsing freeform terminal output. The `.tarn.yaml` test format itself was deliberately chosen over a custom DSL because Claude Code already has strong priors on YAML syntax, reducing the probability of malformed generation and the need for corrective prompt engineering around format edge cases.

The development process described by Kalytiuk offers a revealing case study in human-AI co-design. The failure taxonomy schema, the assertion DSL, and the MCP tool surface were all iterated through paired Claude Code sessions — a workflow where the developer proposed goals, evaluated model-generated options, and pushed back on designs that felt incorrect, converging on decisions grounded in what the model could generate reliably. The project's `CLAUDE.md` file, described as institutional memory written for an LLM, encodes hard-won constraints like never suppressing Clippy warnings and always verifying install command URLs — behavioral guardrails that emerged from observed model failure modes during development. This kind of explicit, codified agent instruction set represents an emerging practice in LLM-assisted software projects, where the human's primary contribution shifts toward defining the boundaries of acceptable model behavior rather than writing implementation directly.

Tarn enters a space already occupied by tools like Hurl and Bruno but carves a deliberately narrow niche. It explicitly declines to pursue Hurl's XPath support, a full filter DSL, OpenAPI-first generation, or a GUI, positioning itself instead as infrastructure specifically optimized for the agentic test loop. This reflects a broader pattern in the AI tooling ecosystem: purpose-built instruments designed not for human ergonomics but for the operational characteristics of LLM agents, which benefit from machine-readable outputs, deterministic branching structures, and minimal ambiguity in failure signaling. The project's MIT license and single static binary distribution lower adoption friction for both individual developers and CI/CD pipelines, suggesting Kalytiuk is prioritizing ecosystem reach over monetization in the near term.

The release of Tarn coincides with a wider industry movement toward MCP-native tooling as a standard integration layer for AI agents. Platforms like TestSprite have demonstrated that MCP-integrated testing tools can substantially improve automated test pass rates — citing improvements from 42% to 93% in some benchmarks — by giving agents structured feedback loops rather than opaque failure signals. Tarn's approach generalizes this principle to the open-source, self-hosted context, making structured agentic test debugging accessible without requiring proprietary cloud infrastructure. As Claude Code and similar agentic coding tools continue to mature, the demand for machine-legible tooling interfaces is likely to accelerate, and Tarn's design philosophy — structure over narration, taxonomy over booleans, tool calls over shell parsing — positions it as an early example of what that category of infrastructure may look like at scale.

Read original article →