Harness design for long-running application development

**Harness design for long-running application development** — Prithvi Rajasekaran reveals that context resets (fully clearing the context window with structured handoffs) outperform compaction for long-running agentic tasks, as models like Claude Sonnet still exhibit "context anxiety" that compaction alone can't solve. Separating generator and evaluator agents addresses another persistent problem: models reliably over-praise their own work, but external evaluators can be calibrated to provide concrete, skeptical feedback that drives iteration. A three-agent architecture (planner, generator, evaluator) with explicit grading criteria successfully produces high-quality designs and full-stack applications across multi-hour autonomous sessions.

Detailed Analysis

Anthropic's Labs team has published a technical account of multi-agent harness engineering aimed at enabling Claude to autonomously build complete, high-quality applications over extended coding sessions. Authored by Prithvi Rajasekaran, the piece documents efforts to overcome two persistent failure modes in long-running agentic coding: context degradation as the context window fills, and self-evaluation bias in which models reliably overrate their own outputs. The solution Rajasekaran arrived at is a three-agent architecture composed of a planner, a generator, and an evaluator, each with a distinct role, operating across multi-hour autonomous sessions without human intervention.

The context degradation problem is addressed through deliberate "context resets" rather than compaction. Compaction—summarizing earlier conversation history in place so a single agent can continue on a shortened transcript—preserves continuity but does not eliminate what the article terms "context anxiety," a documented tendency in some models to prematurely wrap up work as they approach a perceived context limit. Claude Sonnet 4.5 exhibited this behavior strongly enough that compaction alone proved insufficient. Context resets, by contrast, clear the window entirely and instantiate a fresh agent, using a structured handoff artifact to transfer state and next steps. The trade-off is real: this approach introduces orchestration complexity, additional token overhead, and latency. Nevertheless, the clean-slate properties of a reset proved essential for sustaining coherent performance on complex, lengthy tasks.

The self-evaluation problem is addressed by separating the agent doing the work from the agent judging it, drawing explicit inspiration from Generative Adversarial Networks. The core insight is that tuning a standalone evaluator to be skeptical is substantially more tractable than making a generator self-critical. Even a separated evaluator starts with a natural tendency toward leniency regarding LLM-generated outputs, but that disposition can be systematically corrected through prompt engineering in a way that is far harder to achieve through introspective self-critique. This dynamic is especially pronounced on subjective tasks like frontend design, where no binary correctness check exists. To make aesthetic quality gradable, Rajasekaran developed concrete rubric criteria—including design coherence and originality—that translate inherently subjective judgments into structured, repeatable grading dimensions. Both the generator and evaluator receive these criteria, giving the feedback loop a shared vocabulary to iterate against.

The broader significance of this work lies in what it reveals about the current frontier of agentic AI engineering. The techniques described—multi-agent decomposition, structured handoff artifacts, adversarial evaluation loops, and context management through resets rather than compaction—represent a maturation of the field beyond single-agent prompting. The article notes that the broader developer community has independently converged on related patterns, such as continuous iteration cycles using hooks and scripts, suggesting these are not idiosyncratic solutions but emergent best practices for long-horizon autonomy. The explicit acknowledgment of persistent failure modes, including context anxiety and self-flattery, is notable: it frames the engineering challenge not as a matter of capability limits alone but as a systems design problem amenable to architectural solutions.

This work connects directly to a wider industry conversation about reliable agentic behavior—one that has intensified as AI labs push models into longer task horizons with less human oversight. The distinction Anthropic draws between compaction and context resets is a meaningful contribution to that conversation, highlighting that context management is not a monolithic problem and that different failure modes call for different interventions. Similarly, the use of adversarial evaluation structures to compensate for model-level leniency suggests a design philosophy in which known model biases are treated as engineering constraints to route around, rather than flaws to wait for training to fix. As autonomous coding and application generation become increasingly central use cases, these architectural patterns are likely to become foundational to how practitioners build reliable long-running AI systems.

Read original article →

Detailed Analysis

Don't Miss a Deploy