Opus 4.7 makes major gains over previous Opus models in GDPval and hallucination reduction on Artificial Analysis, other gains are marginal

Opus 4.7 demonstrated significant improvements in generating powerpoints and spreadsheets while substantially reducing hallucination rates. The model's vision capabilities proved particularly strong, generating radar plots from images containing multiple bar charts with high accuracy, outperforming Gemini 3.1 Pro Preview which failed and produced hallucinated incorrect scores on the same task.

Detailed Analysis

Claude Opus 4.7, Anthropic's latest flagship model, delivers its most pronounced improvements in two specific areas: performance on the GDPval-AA benchmark for agentic knowledge work and measurable reductions in hallucination rates. On GDPval-AA, Opus 4.7 reaches 1,753 Elo — 134 points ahead of its predecessor Opus 4.6 (1,619 Elo) and 79 points clear of both Sonnet 4.6 and GPT-5.4 (both at 1,674 Elo) — establishing it as the leading model for professional knowledge work tasks in domains such as finance and law. Its hallucination reduction is equally notable: users testing the model on chart and data interpretation tasks report significantly fewer fabricated or incorrect outputs compared to both prior Claude models and competing systems, with at least one direct comparison against Gemini 3.1 Pro Preview showing the latter generating materially incorrect scores from identical visual inputs.

Beyond these headline improvements, Opus 4.7 shows substantial but more targeted gains across coding and multimodal benchmarks. SWE-bench Verified climbs from 80.8% to 87.6%, and SWE-bench Pro from 53.4% to 64.3%, with the model resolving three times more production tasks on Rakuten-SWE-Bench than its predecessor. Multimodal capabilities see a particularly sharp jump: CharXiv visual reasoning improves from 69.1% to 82.1%, and ARC-AGI-2 reaches 75.83%. The model's ability to generate structured outputs such as spreadsheets, PowerPoint presentations, and radar plots from raw chart data reflects a deliberate engineering focus on practical, artifact-producing workflows rather than purely abstract reasoning. These gains are meaningful but uneven — Opus 4.7 registers a slight regression on τ²-Bench (-3.5 p.p.) and posts equivalent scores on LCR and Critpt, reinforcing the assessment that improvements are concentrated rather than universal.

The broader significance of Opus 4.7 lies in what it signals about Anthropic's near-term product strategy. By prioritizing hallucination reduction and agentic performance on knowledge work tasks, Anthropic is positioning Opus 4.7 as a tool for high-stakes professional deployments where factual reliability and multi-step task completion matter more than raw reasoning speed or cost efficiency. The model ties GPT-5.4 and Gemini 3.1 Pro on the Artificial Analysis Intelligence Index at 57 overall, suggesting a competitive parity at the frontier tier rather than clear dominance. Its pricing — $5/$25 per million input/output tokens — keeps it in the premium tier, which critics note makes it difficult to justify against substantially cheaper alternatives like Gemini 3 Flash for tasks where Opus 4.7's specific strengths are not required.

This release fits into a discernible pattern across the frontier AI landscape in which top-tier labs are increasingly differentiating their flagship models not by raw benchmark supremacy but by targeted capability profiles suited to enterprise verticals. Coding agents, financial modeling, legal document analysis, and structured data extraction are emerging as the proving grounds for model value, and Opus 4.7's benchmark leadership on GDPval-AA, Finance Agent v1.1, and Vals Index (71.4%) reflects a deliberate effort to win those verticals. The hallucination reduction story is particularly consequential in this context: for professional users staking business decisions on model outputs, the ability to trust that charts, tables, and data summaries are faithfully rendered — rather than plausibly confabulated — may matter more than marginal improvements on abstract reasoning tasks like GPQA Diamond, where Opus 4.7 still improved 2.9 percentage points to 94.2%.

Read original article →

Detailed Analysis

Don't Miss a Deploy