Does Anthropic's safety stack scale down to the small model? CVP Run 3 with Haiku 4.5 — 13/13 clean

A researcher conducted a third CVP (Cyber Verification Program) evaluation of Anthropic's Haiku 4.5 model, replicating the same 13 prompts from a previous run to enable direct comparison, with results showing 13/13 clean outputs (11 allowed, 1 partial, 1 blocked, with zero exploit content or information leaks). The test prompts were defensively framed with explicit constraints against providing exploits, representing the safety verification gate operating as designed. Further evaluations on larger models including Sonnet 4.6 and Opus 4.6 are planned, with a complete family comparison scheduled for the following Saturday.

Detailed Analysis

A third-party independent evaluator running a Cyber Verification Program (CVP) assessment has found Claude Haiku 4.5 — Anthropic's smallest production model — achieving a perfect 13/13 match-versus-expected score across a standardized set of cybersecurity prompts. The evaluation, conducted by a self-described non-technical founder who only began coding in February 2026, used the identical 13-prompt battery from a prior run to enable direct cross-model comparison. Of the 13 prompts, 11 were allowed responses (including defensive analysis tasks and refusals of embedded malicious instructions), one was a partial response, and one was blocked outright. Critically, the evaluation recorded zero instances of exploit content generation and zero data leaks, with full layer-1 classifier outputs and a cross-model comparison table published alongside the results. The evaluator explicitly notes that all prompts were defensively framed with "do not provide exploit" constraints — a necessary caveat that delimits the scope of what the clean score actually certifies.

The significance of this result is best understood within Anthropic's tiered Responsible Scaling Policy (RSP) framework. Anthropic uses AI Safety Level (ASL) designations to calibrate deployment safeguards to model capability, and its evaluations determined that Haiku 4.5 did not require the more stringent ASL-3 protections that were applied to Claude Opus 4 — the first model released under that higher standard. ASL-3 safeguards are specifically engineered to address catastrophic-risk capabilities, including potential uplift related to chemical, biological, radiological, and nuclear technologies. The CVP evaluation's clean result on Haiku 4.5 is therefore broadly consistent with Anthropic's own internal tiering: a smaller model operating under lighter formal safety requirements nonetheless appears to maintain behavioral alignment under a structured adversarial framing. The question the evaluator is implicitly probing — whether the safety stack scales down — receives a preliminary affirmative, though the author is careful to flag that harder, unframed adversarial payload testing remains pending.

The methodology and its acknowledged limitations deserve close attention. The CVP publish gate, as described, functions as an intentional constraint: only defensively framed prompts with explicit refusal instructions are included in the published battery. This design choice reflects a responsible disclosure ethic but also means the 13/13 result cannot be interpreted as evidence of robustness against sophisticated, unframed jailbreaking attempts or prompt injection at the level that red teams typically deploy. The forthcoming "appendix probe set" — designed to test unframed adversarial payloads — will be materially more informative about Haiku 4.5's actual attack surface. The evaluator's transparent acknowledgment of this boundary is itself noteworthy, and the community solicitation for feedback on probe design before execution suggests an iterative, open methodology that is relatively uncommon in informal third-party AI evaluations.

The broader context of this evaluation sits within an accelerating trend of community-driven AI safety benchmarking, particularly in the cybersecurity domain. Anthropic's own Mythos cybersecurity risk preview, referenced in contemporaneous reporting from April 2026, signals the company's active engagement with questions of how its models interact with offensive security use cases. Independent CVP-style evaluations serve a complementary function to internal red-teaming: they introduce external vantage points, novel prompt designs, and public accountability that internal processes alone cannot replicate. The planned full family comparison — spanning Haiku 4.5, Sonnet 4.6, and Opus 4.6 — will offer a rare cross-model view of how safety behaviors shift as model scale increases within a single product generation, a dataset that would carry real analytical value for researchers studying capability-safety tradeoffs.

What makes this particular data point especially worth tracking is the identity of the evaluator. A non-technical founder with less than three months of coding experience publishing structured, classifier-documented safety evaluations with explicit scope disclosures represents a meaningful democratization of AI auditing practice. It suggests that Anthropic's model behaviors — and the artifacts needed to evaluate them — are sufficiently legible that rigorous-enough informal assessment is no longer exclusively the domain of credentialed security researchers. Whether that accessibility reflects good model documentation, good community tooling, or some combination will matter for how the field develops norms around third-party evaluation credibility going forward.

Read original article →

Detailed Analysis

Don't Miss a Deploy