The real reason coding agents fail in real repos — and it's not the model

Coding agent failures in real repositories stem from lack of structured repository context rather than model limitations. Agents struggle when repositories lack clarity on priorities, validation requirements, architectural decisions, and project completion criteria, forcing them to make uninformed guesses. An experimental harness has been developed that adds structure through documentation files including CLAUDE.md, architecture notes, and decision records to address these contextual gaps.

Detailed Analysis

Coding agent failures in production repositories stem primarily from structural context deficits rather than model capability limitations, according to a practitioner who has tracked hundreds of such failures in real-world deployments. The core argument is that agents like Claude lack access to the implicit knowledge that human developers accumulate over time — understanding of validation logic, established architectural decisions, team-specific definitions of completion, and the correct reading order for navigating an unfamiliar codebase. Without this scaffolding, agents default to inference and pattern-matching, producing outputs that are technically syntactically valid but contextually misaligned with the actual project environment.

The author's proposed remedy centers on what they describe as a "repo-level harness" — a deliberate effort to externalize and structure the tacit knowledge embedded in mature codebases. The specific artifacts mentioned include CLAUDE.md files, architecture notes, test matrices, and decision records (likely referencing Architecture Decision Records, or ADRs, a practice common in engineering organizations). These are not novel concepts in software engineering, but applying them explicitly as agent-facing context layers represents a meaningful shift in how teams would need to think about documentation. The CLAUDE.md convention in particular has emerged organically within the Claude user community as a mechanism for giving Claude persistent, repo-specific behavioral instructions.

This observation connects to a broader and increasingly recognized challenge in the agentic AI space: the gap between benchmark performance and real-world utility. Models like Claude may perform exceptionally well on isolated coding tasks or standardized evaluations while struggling in enterprise repositories that carry years of accumulated technical debt, idiosyncratic conventions, and undocumented constraints. The failure mode described — confident execution based on incorrect assumptions — is arguably more dangerous than outright refusal, since it can produce plausible-looking but subtly wrong outputs that pass shallow review.

The framing also reflects a maturing understanding among practitioners that deploying coding agents successfully is as much an information architecture problem as it is a model selection problem. Teams that invest in structured context — making explicit what was previously implicit — are essentially building a form of institutional memory that benefits both human and AI collaborators. The post's appeal to the community for crowdsourced failure patterns suggests that this knowledge is currently fragmented across individual teams rather than codified into widely shared best practices or tooling standards.

Anthropic has itself acknowledged this dynamic through the CLAUDE.md convention, which it has incorporated into documentation for Claude's agentic use cases. The broader trajectory points toward a future where repo hygiene and agent-readiness become intertwined concerns in software development workflows, with teams increasingly expected to maintain context artifacts not just for human onboarding but as a prerequisite for effective AI-assisted development. The practitioner's experimental harness approach represents an early, grass-roots attempt to operationalize that principle before the tooling ecosystem fully catches up.

Read original article →

Detailed Analysis

Don't Miss a Deploy