Detailed Analysis
StrongDM's three-person engineering team — Justin McCarthy, Jay Taylor, and Navan Chauhan — has constructed what observers are calling a "Dark Factory" or "Level 5" AI development environment, in which production-grade security software is written, reviewed, and shipped entirely by AI agents without any human writing or reviewing a single line of code. Launched in July 2025, the team's "Software Factory" methodology enforces three inviolable rules: no human-written code, no human code reviews, and a minimum of $1,000 in daily token spend per engineer to ensure sufficiently deep reliance on AI. CTO Justin McCarthy frames any instance of a human performing a coding task as a "failure of imagination," a deliberate cultural posture that treats the question "Why am I doing this?" as a standing challenge to default toward agent delegation. By October 2025, three months into the experiment, the team was demoing working prototypes to outside observers including developer Simon Willison, who labeled the approach a qualitative leap beyond the "spicy autocomplete" and human-AI pairing that characterizes most teams' AI usage.
The technical architecture underpinning StrongDM's factory is sophisticated and specifically engineered to prevent the failure modes that arise when agents are given too much visibility into their own evaluation criteria. Rather than conventional test suites, the team uses external "scenarios" — end-to-end user stories kept deliberately invisible to agents outside the codebase — to simulate real-world behavior and prevent agents from gaming validations through techniques like fake assertions. Agents leverage models including Anthropic's Opus 4.5 and, following October 2025 demonstrations, GPT 5.2, operating within a "Digital Twin Universe" that clones live services into sandboxed environments. Swarms of simulated test agents run these scenarios autonomously, iterating until convergence without human intervention. The infrastructure costs are substantial — approximately $20,000 per engineer per month in inference costs alone — but the team treats this as a reasonable price for the velocity and headcount compression it enables.
The article situates StrongDM's experiment within a broader, industry-wide shift occurring simultaneously at both Anthropic and OpenAI. At Anthropic, Claude Code — the company's agentic coding tool — was itself built predominantly by Claude Code, with 90% of its codebase generated by the model and that figure converging toward 100%. Boris Trenne, an engineer on the project, has described his role as having shifted entirely away from writing code toward specification, direction, and judgment. Anthropic has estimated that the entire company is approaching a state of fully AI-generated code as of early 2026. OpenAI, meanwhile, reports that Codex 5.3 was the first frontier model to be meaningfully self-referential in its own construction — prior model versions analyzed training logs and flagged failing tests during the build process, contributing to a 25% speed improvement and 93% reduction in wasted tokens. These are not isolated anecdotes but parallel data points suggesting a structural inflection point in how advanced software is produced.
What distinguishes StrongDM's case from the broader trend is not the use of AI but the completeness and institutional commitment of its implementation. Most engineering organizations using AI tools remain at what Willison categorizes as Levels 2 or 3 — AI assists humans who still author and review code. StrongDM has eliminated the human from the code-production loop entirely, which raises meaningful questions about accountability, auditability, and trust that the research context acknowledges are beginning to draw academic scrutiny, including analysis from Stanford Law. The team's answer to the reliability problem — replacing human review with scenario-based behavioral validation run by agents themselves — is a bet that emergent agent reliability, sufficiently tested against opaque real-world criteria, is a more scalable trust mechanism than human oversight. Whether that bet holds as the systems grow in complexity and stakes remains the central open question.
The convergence of StrongDM's factory model, Anthropic's self-building Claude Code, and OpenAI's self-improving Codex points toward a near-term future in which the bottleneck in software production is no longer execution but specification — the ability to precisely define what correct behavior looks like before an agent ever writes a line. The engineering role is not disappearing but is undergoing a fundamental reorientation: from craftsperson to architect, from implementer to evaluator. StrongDM's team of three, shipping production security infrastructure at high velocity with $20,000 monthly inference bills, may represent the earliest stable prototype of what engineering organizations will broadly resemble within a few years — not a curiosity, but a leading indicator.
Read original article →