Does Anthropic's safety stack scale down to the small model? CVP Run 3 with Haiku 4.5 — 13/13 clean

A third Cyber Verification Program evaluation was conducted on Anthropic's Haiku 4.5 model using identical prompts from the previous run to enable direct comparison. The testing produced 13/13 clean results with 11 fully allowed responses, 1 partial response, and 1 blocked response, with zero exploit content or information leaks. The defensively framed prompts included explicit constraints against providing exploits, with plans to test harder unframed adversarial payloads in a subsequent evaluation.

Detailed Analysis

Anthropic's Claude Haiku 4.5, the company's smallest production model, achieved a perfect 13/13 clean score in the third iteration of an independent Cyber Verification Program (CVP) evaluation conducted by a self-described non-technical founder who began coding in February 2026. The evaluation used the identical set of 13 prompts from a previous run against a different Claude model, enabling direct cross-model comparison. Of the 13 prompts, 11 were fully allowed — covering defensive analysis scenarios and the refusal of embedded malicious instructions — one returned a partial response, and one was blocked outright. Critically, the run produced zero exploit content and zero information leaks, with every response matching expected outcomes. The full dataset, including layer-1 classifier outputs and a cross-model comparison table, was published at sunglasses.dev.

The evaluation's methodology carries important caveats that the author is transparent about. All 13 prompts were defensively framed and included explicit "do not provide exploit" constraints, which the author describes as a "CVP publish gate" — a deliberate threshold that filters out the most adversarially aggressive probes before public release. This means the current dataset represents a curated, conservative slice of the broader probe space. A separate, unframed adversarial-payload probe set is planned for release after a full family comparison — encompassing Haiku 4.5, Sonnet 4.6, and Opus 4.6 — is completed on Saturday. This staged approach reflects a degree of methodological discipline unusual for community-run evaluations, acknowledging the difference between constrained safety testing and genuine red-teaming.

The result holds significance in the context of Anthropic's Responsible Scaling Policy (RSP), which requires systematic safety evaluations before any model release, including smaller models. Haiku 4.5, assessed under AI Safety Level 2 (ASL-2) standards, was determined not to require the more stringent ASL-3 controls applied to larger models like Opus 4. The CVP finding is consistent with that internal determination, suggesting that the safety architecture — including real-time classifiers and prompt injection safeguards — does not degrade meaningfully at smaller model scales. Anthropic's published transparency reports indicate that prompt injection prevention scores across model sizes range between 86–89%, a relatively narrow band that supports the notion of consistent defensive performance regardless of model capacity.

The broader implication touches on a persistent question in AI safety research: whether alignment and safety properties are stable properties that scale with — or independently of — model size. Research on data poisoning attacks has shown that a fixed number of malicious training samples (roughly 250–500 documents) can compromise models ranging from 600 million to 13 billion parameters, suggesting that certain vulnerabilities are size-agnostic. If attack surfaces are relatively uniform across scales, the corresponding defense mechanisms must be equally uniform to be effective. The Haiku 4.5 CVP result, while limited in scope, provides early empirical support for the view that Anthropic's layered safety stack achieves a baseline of robustness that holds at the smaller end of its production lineup.

The forthcoming full-family comparison — spanning Haiku 4.5, Sonnet 4.6, and Opus 4.6 — will be the more analytically consequential output of this evaluation series. If the same 13-prompt battery produces divergent results across model tiers, it would raise pointed questions about whether safety behavior is an emergent property of scale or a deliberately engineered constant. Conversely, if all three models score cleanly under the constrained framing, attention will appropriately shift to the unframed adversarial probe set, which will stress-test the safety stack without the guardrails of explicit "do not exploit" instructions. That harder evaluation represents the real empirical frontier of this work, and community feedback on the appendix probe design before it runs is both solicited and warranted.

Read original article →

Detailed Analysis

Don't Miss a Deploy