tested 9 models with and without agent skills. Haiku 4.5 with a skill beat baseline Opus 4.7.

Research testing 9 models across 880 evaluations found that Haiku 4.5 paired with agent skills achieved 84.3% accuracy, outperforming the baseline Opus 4.7 at 80.5% while costing significantly less at $0.12 per run versus $0.61. Skills provided greater performance gains to weaker models, with Haiku gaining 23.1 points compared to Opus's 14-point improvement, while the marginal cost of adding a skill to Haiku was minimal at 1.5 cents per run. The findings suggest that Haiku with appropriate agent skills offers sufficient performance and speed for routine coding tasks at a fraction of the cost of larger models.

Detailed Analysis

Researchers at Tessl conducted an 880-evaluation benchmark across 11 coding-focused agent skills, 8 models, and 5 scenarios to measure how structured "skills" — discrete, context-injected capability modules — affect AI model performance relative to unaided baselines. The headline finding is striking: Claude Haiku 4.5, Anthropic's smallest and least expensive model in the current lineup, achieved an 84.3% score when augmented with a relevant skill, surpassing Claude Opus 4.7's unaided baseline of 80.5%. Haiku 4.5 without any skill scored only 61.2%, meaning the skill injection alone accounted for a 23.1 percentage point lift. The benchmark covered coding-centric tasks such as commit message generation, code review, and refactoring, and the results held directionally across other vendors including OpenAI's Codex variants and Cursor's Composer-2.

The cost dimension of these findings is as consequential as the performance data. A Haiku 4.5 run augmented with a skill cost approximately $0.12, compared to $0.61 for a baseline Opus 4.7 run — a roughly 5x cost differential in favor of the smaller model. The marginal cost of adding a skill to Haiku was only $0.015, while the same skill added $0.39 per Opus run, suggesting that skill-augmented inference scales disproportionately in cost on larger models. For production environments running thousands or millions of agentic calls, this cost asymmetry has substantial implications for infrastructure budgeting and model selection strategy.

A notable structural finding from the research is that weaker models benefited more from skill augmentation than stronger ones. Haiku gained 23.1 points while Opus 4.7 gained only 14, suggesting that smaller models have more untapped headroom that structured context can unlock. This aligns with a broader pattern in AI engineering: stronger frontier models already internalize much of the domain knowledge that a skill explicitly supplies, so the marginal informational value of a skill is lower for them. Conversely, smaller models operating without such scaffolding are more severely constrained by their parametric knowledge limits, and external skill injection functions almost as a form of retrieval-augmented generation for behavioral guidance rather than factual recall.

These findings connect to a significant trend in applied AI development: the decomposition of monolithic model capability into layered, composable systems. Rather than treating model selection as a binary choice between raw capability tiers, practitioners are increasingly finding that system-level design — how context is structured, what domain knowledge is injected, and how tasks are scoped — can compensate for or even exceed raw model scale. The Tessl benchmark provides empirical grounding for what many engineers have suspected anecdotally: that a well-engineered smaller model often outperforms a poorly contextualized large one on bounded, well-defined tasks. This has direct implications for how teams architect multi-agent pipelines, where routing tasks to appropriately scoped models with purpose-built skills may deliver both better performance and significantly lower operational costs than defaulting to the most capable available model across the board.

The research does carry important caveats that temper generalization. The 11 skills tested are exclusively coding-focused, and the authors themselves characterize the findings as "directional" rather than definitive. Tasks requiring deep reasoning, novel synthesis, or cross-domain generalization may not see the same lift pattern. Nevertheless, for the large class of routine software development tasks that constitute the majority of day-to-day developer workflows, the Tessl data makes a substantive case that Haiku-class models equipped with targeted skills represent a credible default tier — a meaningful shift from the prevailing practice of reaching for Opus-tier models as a general-purpose fallback.

Read original article →

Detailed Analysis

Don't Miss a Deploy