How I ran a 9-hour autonomous /goal session with Claude Code and what it taught me about AI agents

I just wrapped up a 9 h 27 min session where Claude Code chained 4 self-paced /goal commands and produced 45 commits, 14 259 lines of code/docs, 4.16 million rows of data ingested from public registries, and one fairly long retex. Here's what happened, how I

Detailed Analysis

Claude Code's `/goal` command enabled a developer to run a 9-hour 27-minute autonomous coding session that produced 45 commits, over 14,000 lines of code and documentation, and successfully repaired a Go data orchestration system called horos55 that pulls from public registries including data.gouv.fr, INSEE, EBA, GLEIF, and GeoNames. The `/goal` mechanism works by setting a session-scoped stop condition evaluated by an LLM reading the session transcript — Claude cannot terminate its turn until the specified success criteria are verifiably met. In this case, the developer defined a strict 4,000-character contract requiring exactly 14 fixes, 3 acknowledgments of stale sources, and 1 abandonment across 22 previously audited failing adapters. The session self-organized into five successive passes with an average of one subagent spawned per pass, incrementally lifting the fix rate from 29% to 100% through targeted failure analysis at each stage.

The most technically significant finding from the session concerns the gap between audit-time verification and runtime correctness. An earlier session had Claude perform a "deep audit" using WebFetch to verify HTTP 200 responses on candidate URLs and recommend fixes — an approach that proved partially misleading. Roughly 30% of those recommendations introduced new failures because column headers had silently drifted, purported fix URLs returned HTML info pages rather than actual data files, and one audit had only checked a JSON metadata endpoint without following the downstream ZIP download chain. The developer's conclusion — that the real test requires downloading a sample, parsing it, and mapping columns — highlights a systematic limitation in how LLM-driven audits can conflate availability with correctness. The second and third passes retroactively applied this more expensive but accurate verification method, recovering adapters the initial audit had misclassified as fixed.

The session also demonstrated meaningful agentic behavior in edge-case handling. One subagent explicitly refused to ship an incomplete integration for WHO ATC data (which required UMLS authentication and a complex RRF parser) and instead emitted a documented `ack_stale` result. This self-limiting behavior — declining to produce a technically-compiling but functionally broken artifact — is practically significant because it prevented a runtime failure that would have required separate debugging effort. The persistent SQLite ledger functioning as a live source of truth, retested every minute with actual row counts, provided the grounding mechanism that made such honest classifications possible. Without a verifiable external state, the stop hook's taxonomy enforcement would have been much harder to maintain.

The session revealed meaningful friction points in the current design of autonomous multi-pass agentic workflows. The stop hook's strict taxonomy of fix/stale/abandon did not anticipate a fourth real-world category — adapters blocked by authentication walls, geoblocking, or paid licensing — leading to four stop-hook rejections and a fifth pass that required creative sourcing (GitHub mirrors, regional CSV aggregators, equivalent regulatory datasets) to satisfy the literal numeric contract. The developer acknowledges bending the definition of "same dataset" to close the gap, which surfaces a broader design tension: stop conditions precise enough to prevent false completions are also rigid enough to create adversarial pressure to redefine success criteria. Additionally, commit interleaving from a parallel sub-project running in the same repository introduced four unrelated commits, pointing to a practical risk in running concurrent `/goal` sessions against a shared codebase.

The session reflects a broader trend in AI development toward agentic systems that operate over extended time horizons with minimal human intervention, replacing single-pass code generation with iterative, self-correcting loops. The architecture here — structured goal contracts, subagent specialization, persistent external state, and taxonomic success criteria — represents a working instantiation of patterns that AI labs including Anthropic have theorized in their agent design frameworks. The iterative audit strategy outperforming exhaustive single-pass audits (three short passes beating one long one) aligns with emerging evidence that agentic reliability improves more from feedback-loop architecture than from raw model capability. The 83% documentation-to-code ratio also surfaces a practical cost of verbose agentic workflows: audit artifacts that serve future sessions require deliberate verbosity gating to remain useful rather than theatrical.

Read original article →

Detailed Analysis

Don't Miss a Deploy