I trust Sonnet as my daily driver now — better code, one-third the tokens. Here's how.

A developer implemented a structured workflow protocol called FRAGUA that uses Claude Sonnet for primary coding work with Opus selectively deployed for critical design and code review phases, achieving 3x productivity gains while reducing monthly token consumption by 60% in a test week. The breakthrough revealed that process structure and proper separation of concerns, rather than model selection alone, determines efficiency in AI-assisted development.

Detailed Analysis

A developer working with Cloudflare Workers and TypeScript published findings from a week-long experiment in which switching from Claude Opus to Claude Sonnet as the primary coding model — while restructuring the surrounding workflow — produced roughly three times the shipped output at one-third the token cost. The author had long defaulted to Opus for complex tasks, treating Sonnet as unreliable, but a significant budget spike triggered by the release of Opus 4.7 forced a re-examination of the underlying workflow rather than the model choice itself. The central conclusion is that Sonnet's previous underperformance was not a capability deficit but a context problem: when design, exploration, implementation, and debugging were tangled in a single long conversation thread, Sonnet stumbled while Opus compensated through sheer reasoning power. Separating those concerns removed the condition that made Opus necessary in the first place.

The workflow the author developed, called FRAGUA (Spanish for "forge"), is a four-phase protocol structured around two named subagent roles — CRITICON and MANAYER. CRITICON is an Opus instance with a single mandate: find flaws, returning a verdict of SHIPPABLE or NEEDS REVISION with findings tiered by severity. It runs first on the written plan, iterating across two to three rounds on the same named instance so that context accumulates rather than resets, and then again on the finished implementation to catch race conditions, resource leaks, and edge cases. MANAYER separates execution into isolated roles — a coder agent working from the CRITICON-approved spec with a clean context window, and a reviewer agent auditing the output — ensuring no compounding conversation history contaminates either pass. The critical insight is sequencing: by the time Sonnet receives a task as the coder, Opus has already validated the architecture across multiple rounds. Sonnet is not being asked to reason about design; it is executing a precise, pre-validated specification, which is where its speed and efficiency advantages are cleanest.

The economics the author describes are non-trivial. CRITICON sessions consume roughly 30,000 to 50,000 Opus tokens per phase — not a negligible cost — but the author argues those tokens are structurally cheaper than the Opus tokens spent re-explaining context during debugging spirals. Two identified examples from the measurement week — a foreign-key ordering bug that would have triggered a five-round debugging session and an API assumption that would have required rebuilding a module — together represent rework the author estimates at three hours, against CRITICON sessions that cost the equivalent of one hour of unfocused Opus usage. The broader principle articulated is that the most expensive token in AI-assisted development is the one spent re-establishing context to fix something that should have been caught upstream. FRAGUA's value proposition is front-loading the expensive model at the critique layer, where a single round of Opus review can eliminate an entire downstream build.

This account sits within a well-documented trend in the Anthropic model ecosystem. Claude 3.5 Sonnet demonstrated that a mid-tier model could outperform the prior generation's flagship — solving 64% of internal agentic coding problems compared to 38% for Claude 3 Opus — and subsequent Sonnet iterations have continued pushing that benchmark upward. Claude 3.7 Sonnet introduced extended thinking modes and hybrid reasoning that made it competitive with far heavier models on enterprise automation tasks; Claude 4.5 and 4.6 added reliability improvements in multi-step agentic workflows and instruction-following. The pattern across these releases is consistent: Anthropic has been compressing capability downward through the model tiers while expanding speed and cost efficiency. What FRAGUA operationalizes is the logical endpoint of that compression — using the architecture of the workflow to match model capability to task requirements at each phase, rather than using the highest-capability model as a universal compensator for process deficits.

The author is candid about the limits of the data — one developer, one week, one stack — and acknowledges the prior art landscape, including Ralph Loop's autonomous retry mechanism, the GSD spec-driven workflow, and hamelsmu's single-pass cross-model review. What FRAGUA reportedly adds that existing frameworks do not combine is the pairing of design critique with isolated execution and implementation critique, and specifically the use of a persistent named Opus instance across iterative critique rounds so findings compound rather than reset. Whether this represents a genuinely novel contribution or a synthesis of existing ideas is a question the developer explicitly leaves open. What the experiment does illustrate more broadly is that the dominant cost driver in AI-assisted development may not be model pricing per token, but process design — and that workflows built to compensate for model limitations with raw capability may be producing the opposite of the efficiency they appear to offer.

Read original article →

Detailed Analysis

Don't Miss a Deploy