I built a self-evolving agentic loop that ran 104 iterations autonomously to find questions that break every LLM — here's the architecture

A researcher built a self-evolving agentic loop system designed to autonomously find questions that cause all large language models to fail while humans succeed. The system ran 104 iterations using a bash script that spawned fresh instances per iteration, with an agent that accumulated 1,549 lines of learned lessons and a multi-agent verification architecture. Question #103 achieved 0% consensus among five parallel AI agents, revealing a riddle that all tested LLMs answered incorrectly while any human would solve correctly.

Detailed Analysis

A developer has publicly documented the construction of a self-evolving agentic loop that autonomously ran 104 iterations using Claude Code to identify questions that reliably produce incorrect answers from large language models — effectively an automated adversarial red-teaming system. The system's goal was to discover what the author calls the next "strawberry problem": deceptively simple prompts that humans answer easily but that consistently confound state-of-the-art AI. The architecture centers on a bash orchestration script (`ralph.sh`) that spawns fresh Claude Code instances per iteration, using a binary stopping condition tied to consensus thresholds. When five independently operating verification agents — isolated from one another — converge on fewer than 10% matching answers, the loop halts. By that standard, question #103 produced 0% consensus across all five agents, meaning every agent gave a different wrong answer to a riddle described as trivially solvable by humans.

The most architecturally notable feature of the system is its self-modification mechanism. The researcher agent's instruction file grows with each iteration, appending failed attempts as explicit "lessons learned." By iteration 104, this file had accumulated 1,549 lines of embedded procedural memory — a form of persistent, self-authored context that guided the agent's search strategy over time. Crucially, the system independently pivoted its methodology, moving away from surface-level character-counting tricks toward what the author characterizes as "cognitive exploits," a category of questions targeting reasoning architecture rather than tokenization quirks. This emergent strategic shift was not explicitly programmed but emerged from the accumulation of failure signals encoded back into the agent's working context.

This project sits at the intersection of two significant research threads in modern AI development. The first is agentic evaluation design — the challenge of building systems that can reliably assess AI capabilities and failure modes without constant human supervision. Anthropic's own published guidance on building effective agents emphasizes simple, composable patterns — evaluator-optimizer loops, multi-agent verification, and resumable state machines — all of which appear directly in this architecture. The second thread is adversarial benchmark generation, a problem that has grown acute as standard benchmarks become saturated. The "strawberry problem" reference is deliberate: OpenAI's o1 model was famously noted to struggle with counting the letter "r" in "strawberry," a moment that revealed gaps between apparent capability and basic symbolic reasoning. Systematizing the search for such gaps — rather than discovering them anecdotally — represents a meaningful methodological advance.

The broader significance of this project lies in what it demonstrates about the accessibility of agentic infrastructure. The entire system was built by an individual developer using publicly available tooling, a YAML-based state machine for crash recovery, and Claude Code as the core agent runtime. The multi-agent verification layer — where five isolated agents independently answer questions and a separate verifier scores consensus — mirrors patterns from formal red-teaming pipelines typically deployed by well-resourced AI safety teams. That such architecture can now be assembled and run to 104 autonomous iterations by a solo developer signals a meaningful democratization of adversarial AI evaluation. It also raises pointed questions about the nature of LLM reasoning failures: if a simple riddle can achieve 0% consensus across five frontier-model agents while remaining trivially clear to humans, the failure is not noise but systematic, pointing toward structural blind spots in how these models process certain categories of inference.

Read original article →

Detailed Analysis

Don't Miss a Deploy