I built a tool that measures whether a Claude Code skill actually improves output quality, and tested it on Caveman

A developer built SkillBenchmark to objectively measure whether Claude Code skills improve output quality by running tasks multiple times with and without skills, then having a judge LLM score them blindly against a rubric. When tested on Caveman, a popular skill claiming to reduce output tokens by 65%, the benchmark found no statistically confirmed quality improvement on any of the three tested tasks, despite the skill increasing token costs.

Detailed Analysis

A developer named Ties Petersen has released SkillBenchmark, an open-source tool designed to empirically evaluate whether SKILL.md instruction files used in Claude Code actually improve the quality of AI-generated outputs. SKILL.md files are small prompt-injection files that users drop into their Claude Code projects, theoretically tuning the model's behavior for specific tasks such as writing commit messages, reviewing code, or generating documentation. While hundreds of such skills circulate online and enjoy widespread adoption, no systematic method for validating their effectiveness has previously existed, leaving users to rely on anecdotal impressions rather than quantitative evidence.

SkillBenchmark addresses this gap through a structured experimental design. The tool runs a given task N times under two conditions — with and without the skill injected as a system prompt — and routes all outputs to a judge LLM that scores them blindly against a predefined rubric. Because the judge never sees the original task prompt or knows which condition produced which output, the scoring process is designed to minimize evaluator bias. Results are reported with confidence intervals for both conditions and a delta CI, allowing users to distinguish genuine quality differences from statistical noise. This methodology borrows from established A/B testing and blind evaluation practices common in empirical machine learning research.

Petersen applied SkillBenchmark to Caveman, one of the more prominent SKILL.md files, which claims to reduce LLM output tokens by approximately 65% while preserving technical accuracy. Across three tasks, five runs each, and three judges, the results showed no statistically confirmed quality improvement in any condition — all confidence intervals overlapped. More notably, Caveman actually increased token costs on every run due to the overhead of injecting the system prompt itself, directly contradicting its core efficiency claim. The results do not necessarily indicate that Caveman or similar skills are universally ineffective, but they do illustrate that claimed benefits can fail to materialize under controlled measurement.

The broader significance of this work lies in what it reveals about the culture surrounding AI prompt engineering. The SKILL.md ecosystem has grown organically and rapidly, driven by user enthusiasm and informal knowledge-sharing rather than reproducible validation. This mirrors broader patterns in the early development of AI tooling, where heuristics and community consensus often substitute for rigorous evaluation. SkillBenchmark represents a small but meaningful corrective step, introducing the kind of empirical discipline that the field increasingly requires as AI coding assistants become embedded in professional software development workflows.

The release also touches on a structural challenge in evaluating prompt-based AI behavior: the judge-LLM approach, while clever, introduces its own potential confounds, including sensitivity to rubric framing and the capabilities of the judge model itself. As Claude Code and similar agentic coding tools continue to expand their ecosystems of community-contributed configurations and skills, tools like SkillBenchmark may catalyze a more evidence-based standard for what gets published and adopted. The repository is publicly available and ships with the Caveman benchmark as a ready-to-run example, lowering the barrier for other developers to validate skills before incorporating them into production workflows.

Read original article →

Detailed Analysis

Don't Miss a Deploy