Claude Opus 4.7 won 69 of 100 blind evals against Opus 4.6, judged by GPT-5.4, Gemini 3.1 Pro, and DeepSeek V3.2

Claude Opus 4.7 won 69 of 100 blind evaluations against Opus 4.6 when judged by three independent models: GPT-5.4, Gemini 3.1 Pro, and DeepSeek V3.2. GPT-5.4 and Gemini 3.1 Pro favored Opus 4.7 at roughly 70–78% across all categories, while DeepSeek V3.2 systematically preferred Opus 4.6, selecting it in 54 of 97 valid judgments. The divergent outcomes across judges using identical questions and evaluation protocols demonstrate that single-judge leaderboards produce unreliable comparative results.

Detailed Analysis

A community-conducted blind evaluation pitting Claude Opus 4.7 against Opus 4.6 across 100 structured prompts found that Opus 4.7 won 69 of 100 comparisons by majority vote of three independent AI judges — GPT-5.4, Gemini 3.1 Pro, and DeepSeek V3.2. The evaluation, published on Reddit's r/ClaudeAI and backed by open-source tooling on GitHub, distributed questions evenly across five categories: code, reasoning, analysis, communication, and meta-alignment. Opus 4.7 demonstrated its strongest advantages in analysis (16–4) and communication (14–6), with more modest but still positive leads in code (13–6–1), reasoning (12–8), and meta-alignment (13–7). Both models were accessed via OpenRouter at identical inference configurations — temperature 0.7 and max_tokens 4096 for contestants, temperature 0.2 for judges — with response order randomized to prevent position bias. The broader benchmark record supports the directional finding: independent sources document Opus 4.7 improvements including an SWE-bench Verified score of 87.6% versus 80.8% for 4.6, a GPQA Diamond score of 94.2% versus 91.3%, and substantial gains in vision resolution and coding production tasks.

The methodologically significant finding is not the headline win rate but the dramatic disagreement between judges. GPT-5.4 awarded Opus 4.7 a 69.7% win rate; Gemini 3.1 Pro awarded it 77.6%; DeepSeek V3.2, presented with identical prompts and rubric, awarded Opus 4.6 a 54-of-97 majority — reversing the outcome entirely. Crucially, DeepSeek's preference for Opus 4.6 was not confined to a single category but appeared systematically across all five domains, suggesting a structural divergence in how DeepSeek V3.2 operationalizes quality rather than noise from a small sample. This is a concrete empirical demonstration of a known theoretical problem: LLM-as-judge evaluations embed the aesthetic and architectural biases of whichever model is selected to adjudicate, and those biases can be large enough to determine the direction of the conclusion.

This finding lands in a context where automated LLM evaluation has become the dominant methodology for rapid benchmarking. Human evaluation is slow and expensive; model-based judgment scales cheaply and produces structured outputs suitable for leaderboards. The tradeoff is that the judge itself becomes a confound. A model trained with different RLHF reward signals, different data mixtures, or different definitions of "helpful" will systematically prefer responses that match its own internal priors. The three-judge majority-vote protocol the author employed partially hedges against this by requiring cross-family consensus, but the extreme divergence of DeepSeek V3.2 — a model developed in China under a different research culture and optimization target — illustrates how even majority voting may mask deep disagreements rather than resolve them. The fact that two Western-lab models agreed while a third-party model dissented raises questions about whether judge agreement reflects genuine quality or shared training assumptions.

The broader implications connect to Anthropic's positioning of Opus 4.7 as a production upgrade over 4.6. Anthropic's own release documentation emphasizes coding and vision gains, and external commentary recommends it as the default for new coding and production workflows at unchanged pricing. The community evaluation aligns with that recommendation in aggregate, but the analysis domain result — 16 Opus 4.7 wins to 4 for Opus 4.6 — is particularly notable given that analysis tasks tend to involve subjective quality dimensions most sensitive to judge identity. One independent test cited in research context also found Opus 4.7 failing a canvas animation task that 4.6 passed, a reminder that aggregate win rates coexist with localized regressions. Anthropic itself has flagged that Opus 4.7 interprets instructions more literally, which could explain both its advantage in structured analytical tasks and its occasional underperformance in creative or open-ended ones.

At 100 questions per category subset — roughly 20 per category — the evaluation provides directional signal but not the statistical power needed to support narrow categorical claims. The reasoning split of 12–8 falls within the range that could plausibly reverse with a different question set, as the author acknowledges. What the study does establish with credibility is the methodological argument: single-judge leaderboards are fragile, and the selection of a judge model is not a neutral methodological choice. As frontier AI models from Anthropic, OpenAI, Google, and Chinese labs increasingly compete on overlapping capability axes, and as automated evaluation becomes the primary mechanism by which relative performance is established and communicated to developers, the identity and provenance of the judge may be as consequential as the quality of the model under evaluation. The open-source release of the evaluation engine allows the research community to replicate, extend, and stress-test these findings — a constructive contribution to an area where methodological transparency remains inconsistent.

Read original article →

Detailed Analysis

Don't Miss a Deploy