this chart felt shady, so I fixed it (what I found will shock you!)

A data professional identified issues with a chart in Anthropic's Opus 4.8 system card, including its logarithmic scaling, lack of cost-based pricing information, and absence of Sonnet 4.6 comparison data. An independent analysis comparing model outputs across effort levels found that Opus 4.8 at low effort outperformed Sonnet 4.6 at all effort levels while costing less, suggesting the original chart omitted Sonnet because of its poor performance on this benchmark.

Detailed Analysis

An independent researcher has conducted a reanalysis of a performance chart included in Anthropic's Claude Opus 4.8 system card, raising methodological transparency concerns about how the original data was presented. The researcher identified three specific issues with the original chart on page 195 of the system card: the use of a logarithmic horizontal axis, the expression of cost in output tokens rather than dollars, and the conspicuous absence of Claude Sonnet 4.6 as a comparison point despite its inclusion elsewhere throughout the document. To address these gaps, the researcher sampled 50 tasks at random from a publicly available 731-task benchmark set, running evaluations across multiple effort levels and grading output patches inside Docker containers over approximately 24 hours of compute time.

The findings carry meaningful practical implications for developers and organizations choosing between Claude model tiers and effort configurations. Most strikingly, the researcher concludes that Opus 4.8 on low effort outperforms Sonnet 4.6 on medium, high, or maximum effort settings — and does so at lower cost — making low-effort Opus the dominant choice for most tasks except those that Sonnet 4.6 can handle at its own lowest effort level. Equally significant is the revelation that the log scale in the original chart obscured just how dramatically expensive Opus 4.8 becomes at maximum effort; on a linear scale, the cost premium for max mode is described as "crazy," a dynamic the logarithmic presentation effectively softened.

The absence of Sonnet 4.6 from the original system card chart now appears explicable in light of these results: the model performs poorly relative to its cost in this benchmark context, providing what would have been an unflattering comparison had it been included. The researcher explicitly acknowledges this as speculative — that there likely exist other task types where Sonnet 4.6 remains competitive — but the data as presented suggests Anthropic may have made a selective editorial choice in chart construction. This kind of selective presentation in official system cards is a notable concern, as these documents serve as the primary technical reference for enterprise customers and developers making deployment decisions.

The episode fits within a growing pattern of community-led scrutiny of AI lab benchmark reporting practices. As model releases become more frequent and differentiated by version, tier, and configuration, the complexity of honest performance communication increases significantly. Log scales, cherry-picked comparisons, and token-rather-than-dollar cost framing are individually defensible choices but collectively function to reduce the legibility of published evaluations. The researcher's work — though self-described as statistically underpowered and methodologically unvalidated against Anthropic's internal process — demonstrates the demand for and value of independent replication even at the community level. Open calls for follow-up, including cross-provider pricing comparisons and GPT/Codex data integration, suggest this kind of crowd-sourced benchmark auditing may become an increasingly standard check on official model documentation.

Read original article →

Detailed Analysis

Don't Miss a Deploy