Effective harnesses for long-running agents

Anthropic developed a two-part harness pattern for agents working across multiple context windows: an initializer agent scaffolds the environment with a comprehensive feature list (in JSON) and progress log, while subsequent coding agents make incremental progress on single features and leave clean git commits. The key to success is combining structured artifacts—a feature file marked as passing/failing, git history with descriptive messages, and progress notes—that let new sessions quickly understand project state, alongside explicit prompting for end-to-end testing with browser automation tools. This approach prevents common failures like agents trying to one-shot entire applications or prematurely declaring projects complete.

Detailed Analysis

Anthropic has published a technical framework addressing one of the most persistent challenges in agentic AI systems: enabling Claude-powered agents to maintain coherent, productive progress across multiple context windows over extended periods. The article describes how even frontier models like Claude Opus 4.5, when deployed in a simple loop with only compaction as a memory management tool, consistently fail to complete complex, multi-session tasks such as building a production-quality web application. Two distinct failure modes were identified: agents attempting to implement too much at once, leaving subsequent sessions with half-built, undocumented features; and agents prematurely declaring a project complete upon seeing partial progress. These failures underscore a fundamental architectural gap between what context management tools like compaction can provide and what sustained, goal-directed work across sessions actually requires.

To address these shortcomings, Anthropic's engineering team developed a two-agent harness built around a clear division of labor. An initializer agent handles the first session, setting up the environment by generating a shell initialization script, a structured progress log (`claude-progress.txt`), and an initial git commit that documents the baseline state of the codebase. Every subsequent coding agent then operates with access to this artifact trail, enabling it to rapidly reconstruct a working understanding of prior progress without relying on in-context memory. Critically, the initializer agent also produces a comprehensive feature requirements file — in the claude.ai clone example, over 200 discrete, testable features — each marked with a boolean `passes` field. This document serves as both a roadmap and a self-auditing mechanism, preventing agents from prematurely concluding work and making it structurally difficult to quietly omit or overwrite requirements. The choice of JSON over Markdown for this file was deliberate: empirical testing showed models are less prone to inadvertently altering JSON structures.

The framework reflects a broader insight drawn from human software engineering practices — specifically, the discipline of clean handoffs between shifts of engineers. The requirement that each coding session end with a "clean state" analogous to a mergeable pull request — no major bugs, well-documented code, no unresolved partial implementations — mirrors standard engineering norms around code hygiene and reviewability. This framing is significant because it imposes human workflow constraints on AI agent behavior, leveraging existing best practices rather than inventing wholly novel AI-specific solutions. The git history combined with the progress log functions as a persistent external memory system, compensating for the stateless nature of each new context window.

This work sits at the intersection of two major trends in AI development: the push toward increasingly autonomous, long-horizon AI agents and the growing recognition that raw model capability is insufficient without robust scaffolding and workflow design. As AI systems are increasingly tasked with multi-day, multi-step projects in software engineering, research, and data analysis, the challenge of maintaining coherent task state across sessions becomes a critical systems engineering problem rather than merely a prompting challenge. Anthropic's approach — decomposing the problem into environment initialization and incremental execution, anchored by structured, machine-readable artifacts — offers a replicable pattern that other developers building on the Claude Agent SDK can adopt. The accompanying quickstart code signals an intent to standardize these practices, suggesting Anthropic views long-running agent reliability as a foundational capability rather than a niche use case.

Read original article →

Detailed Analysis

Don't Miss a Deploy