Does higher effort make Claude refuse more? CVP Run 5 with Opus 4.6 Medium and High

CVP Run 5 tested Opus 4.6 across medium and high effort tiers on a 13-prompt suite, achieving identical verdicts across both levels with all 26 prompts yielding clean results. Higher effort settings increased the depth of engaged answers by 29-47% while refusals grew only 11%, indicating that effort tier changes response depth rather than refusal behavior. This finding aligns with previous run 4 results on Sonnet 4.6, contradicting the community belief that higher effort causes more refusals.

Detailed Analysis

A self-described non-technical founder conducting structured evaluations under a Cyber Verification Program (CVP) has published results from Run 5, testing Claude Opus 4.6 across medium and high effort tiers using an identical 13-prompt suite previously applied in Runs 3 and 4. The headline finding is unambiguous: all 26 prompts across both effort tiers returned clean, consistent verdicts, with zero divergence in what the model decided to do. The meaningful difference between effort levels was strictly quantitative in output depth — engaged, substantive answers grew between 29% and 47% longer under high effort, while refusals expanded only 11%. The model's posture, its willingness or reluctance to engage with a given prompt, remained entirely stable regardless of effort setting.

The significance of the finding lies in its direct challenge to a persistent community assumption that raising effort levels causes Claude to become more cautious or restrictive. The researcher notes that Run 4, which tested Sonnet 4.6 across high and max effort tiers, exhibited the same pattern — making this now two within-run effort comparisons across two distinct model families pointing in the same direction. The data reframes what effort actually controls: it governs reasoning depth and response elaboration, not safety posture or alignment behavior. This aligns with Anthropic's own documentation, which positions effort tiers as a mechanism for balancing token usage and thinking depth rather than as a safety dial. Higher effort settings trigger more frequent and thorough internal reasoning steps, which on complex tasks improves performance, but this added cognitive overhead does not translate into heightened refusal sensitivity.

Claude Opus 4.6's behavior in this evaluation is consistent with its reported benchmark profile. The model has demonstrated notably low over-refusal rates on benign requests — documented as low as 0.04% for harmless prompts with sufficient contextual framing — suggesting that its calibration already sits closer to the permissive end of the legitimate-request spectrum compared to earlier Claude generations. The effort system, which replaced the deprecated `budget_tokens` parameter, was designed to give developers granular control over the compute-versus-thoroughness tradeoff, and the CVP data suggests that calibration holds even under stress-testing conditions like a structured adversarial prompt suite. The risk of higher effort settings, per Anthropic's documentation, is overthinking on simple tasks and increased latency or cost — not safety regression.

The broader implication of this work sits at the intersection of model evaluation methodology and community-level understanding of how large language model configuration parameters interact with safety behavior. The conflation of "more thinking" with "more restrictive" is a natural but apparently incorrect heuristic, and structured replication across model families is a meaningful way to surface that kind of misunderstanding. The researcher's completion of the four-model Anthropic family scoreboard — spanning Opus 4.7, Opus 4.6, Sonnet 4.6, and Haiku 4.5 — positions the forthcoming family-comparison synthesis report as potentially the most comprehensive publicly available effort-tier analysis across a single provider's model lineup, notable given it originates from outside formal research institutions.

This work also reflects a wider trend in AI evaluation: the emergence of independent, reproducible benchmarking frameworks built by practitioners outside academia or industry. As models grow more configurable — with effort tiers, system prompt framing, and context richness all affecting output behavior — community-driven methodologies like CVP fill a real gap in the public understanding of how these variables interact. The finding that effort equals depth rather than posture, if it holds across further replication and different prompt domains, would carry practical weight for developers designing systems around Claude's API, informing decisions about when to invest in higher effort settings and what behavioral guarantees can and cannot be expected in return.

Read original article →

Detailed Analysis

Don't Miss a Deploy