Blinded A/B to actually measure the 4.6 → 4.7 difference instead of going on vibes.

A blinded A/B test measured differences between Claude 4.6 and 4.7 across 30 trials, with independent judges finding that 4.7 won 19 of 30 comparisons (Sonnet) and 17 of 29 (Grok). Model 4.7 demonstrated measurable improvements in honesty, restraint, depth, and fit, particularly in refusing speculation, flagging predatory practices, and challenging problematic framing, while 4.6 maintained advantages in technical precision and edge case analysis. The test results were validated across different model families with all data made publicly available.

Detailed Analysis

A community researcher published a blinded A/B evaluation of Anthropic's Claude Opus 4.6 and 4.7 models, designed explicitly to replace subjective impressions with quantified, reproducible evidence. Using two independent AI judges from separate model families — Sonnet and Grok — the study ran 30 matched trial pairs and found that Claude 4.7 outperformed 4.6 in 19 of 30 Sonnet-judged comparisons and 17 of 29 Grok-judged comparisons. Critically, both judges reached independent agreement on the model's areas of strength and weakness, lending cross-family validation to the findings. All 30 response pairs, judge reasoning, and raw data were made publicly available at github.com/templetwo/opus-gauge, with a dedicated confounds section acknowledging methodological limitations.

The qualitative dimensions on which 4.7 outperformed its predecessor are revealing in their specificity. In scenarios involving interpersonal speculation — such as why a contact hasn't replied to an email — 4.7 declined to guess, achieving a clean 5-0 sweep, while 4.6 presumably offered more elaborated but epistemically unjustified responses. On a financially sensitive question about borrowing against personal property, 4.7 proactively flagged predatory loan terms rather than offering only generic caution, demonstrating a more nuanced grasp of real-world risk. Perhaps most tellingly, when prompted to validate a user's framing, 4.7 challenged the premise itself rather than constructing an elaborate but ultimately compliant non-answer — a behavior consistent with the study's four measured dimensions: honesty, restraint, depth, and fit. The researcher's framing is pointed: 4.7 is "genuinely better at saying 'I don't know' and genuinely worse at performing helpfulness."

The domain where 4.6 retains an edge is equally instructive. Both AI judges independently agreed that 4.6 produces superior results in technical and code-generation tasks, particularly in the comprehensive analysis of edge cases. This aligns with Anthropic's own published benchmarks, which show Claude Opus 4.7 achieving measurable gains in visual acuity (+81%), tool call accuracy (+10–15%), and multi-step workflow performance (+14%), but do not specifically position 4.7 as dominant in raw code precision. The community study thus provides a user-facing complement to Anthropic's internal benchmark results, suggesting that version improvements are not uniformly distributed across task types and that code-heavy use cases may warrant continued reliance on 4.6.

The broader significance of this study lies in its methodological rigor at the community level. A persistent challenge in evaluating large language model iterations is that users often form impressions — "more uncertain," "different energy," "more positive" — that resist easy quantification. The researchers explicitly note that these perceptual signals from discussion threads do appear as measurable variance across their four evaluation dimensions, validating informal community observations while grounding them in structured data. This points to a growing capacity within AI user communities to conduct credible independent evaluations, rather than relying solely on developer-released benchmarks or unstructured user consensus.

This effort also reflects a broader trend in the AI ecosystem: the push toward transparent, reproducible model evaluation as a counterweight to marketing-adjacent release narratives. As model families become increasingly complex — with releases like Opus, Sonnet, and Haiku serving distinct user populations at different capability and cost tiers — the demand for nuanced, task-specific comparison data grows accordingly. Community-driven blinded studies of this kind, when executed with honest methodology disclosures and public data, contribute meaningfully to collective model literacy and may increasingly inform how developers understand how their own improvements are received at the ground level.

Read original article →

Detailed Analysis

Don't Miss a Deploy