Harness design for long-running application development - Anthropic

Detailed Analysis

Anthropic's engineering team has published detailed documentation on a three-agent harness architecture designed to enable Claude models to autonomously build complete frontend and full-stack applications over extended, multi-hour development sessions. The system divides responsibilities across three specialized agents: a planner that generates structured JSON feature specifications and task breakdowns, a generator that implements code iteratively, and an evaluator that scores outputs against criteria including design quality, originality, craft, and overall functionality. Typical runs involve between 5 and 15 iteration cycles and can extend up to four hours, a timeframe that has historically exposed critical weaknesses in simpler agent configurations. Earlier two-agent setups and approaches using Claude's Agent SDK with context compaction failed on long-horizon tasks due to overambitious one-shot attempts and incomplete state handoffs between sessions.

The architectural choices reflect hard-won lessons about how large language models degrade over extended agentic workflows. A particularly notable technique is the use of full context resets rather than compaction to combat what the team describes as "context anxiety" — a behavioral pattern in which models become increasingly cautious and conservative as they approach their context window limits. Rather than summarizing prior context, the harness uses structured initializer scripts to restart each session from a verified, functional application state, with basic server and browser tests run via Puppeteer MCP to resolve any pre-existing bugs before new work begins. The evaluator agent similarly leverages Playwright MCP to navigate live, rendered pages rather than relying solely on static code review, enabling more realistic critique. Crucially, the evaluator is explicitly calibrated to penalize generic, formulaic outputs — what the team characterizes as "AI slop" — in order to incentivize originality and higher craft standards.

The publication also introduces Managed Agents as an extension of this framework, positioned as a "meta-harness" service that decouples Claude's planning capabilities from its execution environment. By separating the cognitive layer from the sandboxed operational layer, this abstraction enables more infrastructure-agnostic scaling of long-horizon agentic work and avoids the problem of baked-in assumptions about model capabilities becoming stale as Claude's underlying abilities evolve. The commit-by-commit progress enforcement and distributed processing model that characterizes the three-agent harness directly addresses a core failure mode of earlier systems: the tendency for agents to attempt too much in a single pass, producing incomplete or broken deliverables with no recoverable intermediate state.

This work sits at the intersection of several converging trends in applied AI development: the move from single-turn inference toward sustained agentic workflows, the industrialization of multi-agent orchestration patterns, and the growing recognition that reliable AI-assisted software development requires explicit infrastructure design rather than prompt engineering alone. The harness approach acknowledges a fundamental tension in deploying frontier models for complex, real-world tasks — that model capability alone is insufficient without scaffolding that manages context, enforces incremental progress, and introduces external grounding through live environment interaction. The emphasis on human calibration of the evaluator also reflects an industry-wide acknowledgment that fully autonomous quality assessment remains an unsolved problem, particularly for subjective dimensions like design originality.

Looking forward, Anthropic's own documentation suggests that as model capabilities advance — particularly in areas like visual perception, where Claude currently struggles with browser modals and certain rendering details — the complexity of required harness infrastructure may decrease, with simpler scaffolds sufficient for tasks that currently demand elaborate multi-agent coordination. Conversely, the team notes that more capable models may simply be directed toward progressively harder tasks, keeping the demand for sophisticated harness design constant. This framing positions harness architecture not as a temporary workaround for model limitations but as a durable engineering discipline in its own right — one that will scale in sophistication alongside the models it orchestrates.

Read original article →

Detailed Analysis

Don't Miss a Deploy