I replicated Anthropic’s Generator-Evaluator harness to build a website through 12 adversarial AI iterations - here’s the result and what I learned

A developer replicated Anthropic's Generator-Evaluator architecture to build a marketing website for their Mnemo project through 12 iterative cycles, where a Planner initiated the process and a Generator-Evaluator pair refined the output by communicating through files and live browser testing. The system evolved from generic AI design patterns to a distinctive "Terminal Noir" aesthetic with progressive refinements, demonstrating that the architectural constraints and feedback loops determine whether the result is distinctive or generic.

Detailed Analysis

Anthropic's recently published multi-agent architecture — a Generator-Evaluator harness inspired by Generative Adversarial Networks (GANs) — has been independently replicated by a developer who used it to produce a fully functional marketing website through twelve automated adversarial iterations, with zero lines of manually written code. The developer implemented the architecture using Kiro CLI, structuring the pipeline as a one-time Planner feeding a looping Generator-Evaluator pair that communicated exclusively through shared files: a spec document and an evaluation report. Critically, each agent instance was initialized with a clean context slate per invocation, a deliberate design choice that eliminated compounding errors and what the developer termed "context anxiety." The live output — a marketing site for an AI memory tool called Mnemo — is publicly accessible and documents a progression from generic AI output to a distinctive "Terminal Noir" aesthetic featuring IBM Plex Mono typography, amber-on-black color schemes, grain textures, and scanline effects.

The technical implementation reveals several design decisions that distinguished this approach from standard single-shot or retry-based AI generation. The Evaluator agent employed Playwright MCP to interact with the live deployed site directly — navigating pages, clicking elements, and resizing viewports — rather than performing static code review. This behavioral testing layer surfaced visual and accessibility bugs that code-only analysis would systematically miss. Additionally, the harness incorporated frontend design heuristics explicitly penalizing common AI generation defaults such as Inter font usage, purple gradients, and generic card layouts, effectively injecting aesthetic adversarialism into the feedback loop. The result was a system that applied sustained creative pressure rather than optimizing for functional adequacy, with all twelve iterations running to completion regardless of intermediate quality.

The progression documented across twelve iterations illustrates a qualitative phenomenon that distinguishes multi-agent iterative systems from conventional prompt-and-response generation. Iteration one produced output the developer described as functional but forgettable — a predictable baseline for single-shot AI generation. By iteration four, the Generator had executed a creative pivot to the Terminal Noir aesthetic, a shift the developer notes would be unlikely to emerge from single-pass generation. Iterations five through twelve then concentrated on refinement: accessibility improvements, responsive layout corrections, and reduced-motion support. The entire process ran over three hours and twenty minutes, producing a result the developer characterizes as genuinely distinctive — a standard difficult to achieve through conventional AI-assisted development workflows.

The experiment speaks directly to a growing body of thinking in AI development regarding the relative importance of model capability versus system architecture. The developer's central conclusion — that the harness, not the model, determines whether output constitutes "AI slop" or something distinctive — aligns with broader industry movement toward agentic frameworks and multi-model orchestration. Anthropic's original harness design publication is itself part of a wider effort to formalize patterns for long-running, high-autonomy AI applications, and independent replications like this one serve as practical validation of those patterns outside controlled research environments. The use of adversarial structure, borrowed explicitly from GAN methodology, suggests that techniques developed for generative model training may translate productively into inference-time workflow design.

This replication also highlights the increasing accessibility and composability of agentic AI tooling. The developer combined Kiro CLI, Playwright MCP, Next.js, Tailwind, Framer Motion, and TypeScript into a functioning research-quality harness without apparent organizational resources, publishing both the live result and full documentation on GitHub. As Anthropic and other frontier labs continue publishing architectural patterns alongside model releases, this kind of rapid community replication and iteration is likely to accelerate, effectively distributing the development of agentic best practices across a broad practitioner base rather than concentrating it within research institutions. The experiment represents an early instance of what may become a standard workflow: using published multi-agent architectural primitives as scaffolding for domain-specific applications, with the practitioner's role shifting from code author to harness designer.

Read original article →

Detailed Analysis

Don't Miss a Deploy