Detailed Analysis
A scholarly preprint posted to arXiv (arxiv.org/pdf/2604.24827) has attracted attention in AI circles by claiming to estimate parameter counts for major large language models, including Anthropic's Claude Opus 4.6 and 4.7. According to the paper's methodology, Opus 4.7 registers approximately 4 trillion parameters compared to 4.6 trillion for its predecessor, Opus 4.6 — a finding that has circulated on social media alongside a screenshot purportedly showing these figures. The claim has generated discussion because it would represent an unusual regression in model scale at a time when Anthropic is publicly positioning Opus 4.7 as a significant advancement over 4.6.
The research context strongly complicates the paper's narrative. Anthropic has documented Opus 4.7 as delivering substantial benchmark improvements over 4.6, including a jump from 58% to 70% on CursorBench, a 6–8 point improvement on SWE-Bench multi-file tasks, 3x more production tasks resolved on Rakuten-SWE-Bench, and meaningfully enhanced vision capabilities — from 1.15 to 3.75 megapixels of image support. None of Anthropic's official documentation or credible third-party analyses reference parameter counts for either model in either direction, and no source corroborates the claim of a reduction. The absence of any parameter disclosure is itself notable: Anthropic, like most frontier AI labs, treats model architecture details as proprietary, making third-party estimation methods the only available avenue for such figures — and therefore a structurally uncertain one.
The methodology underlying the preprint deserves scrutiny. Estimating parameter counts for closed, proprietary models is a notoriously difficult problem. Such approaches typically rely on indirect signals — inference latency, compute costs, token throughput, or probing techniques — each of which introduces compounding uncertainty. A new tokenizer in Opus 4.7 that generates up to 35% more tokens for equivalent text, combined with the removal of sampling parameters like temperature and top_p, represents significant architectural changes that could confound estimation techniques calibrated on earlier model generations. In other words, the paper's methodology may be measuring proxies that are no longer stable across model versions.
The broader context here touches on a genuine and underexplored tension in modern AI development: the assumption that model capability scales monotonically with parameter count. Techniques like mixture-of-experts architectures, improved training data curation, reinforcement learning from human feedback, and more efficient attention mechanisms have repeatedly demonstrated that smaller or equivalently sized models can outperform larger predecessors. Anthropic's own trajectory — from Claude 2 through the Sonnet and Opus families — reflects a consistent investment in training efficiency alongside scale. It is therefore not implausible in principle that a future model could outperform its predecessor with fewer raw parameters. However, the specific claim here lacks corroboration, and the observed performance improvements in Opus 4.7 across coding, agentic workflows, and vision tasks are more consistent with architectural and training advances than with a straightforward parameter reduction.
The episode illustrates a recurring dynamic in the AI landscape: the hunger for transparency about closed models drives communities toward indirect inference methods whose reliability is difficult to assess. Anthropic's deliberate non-disclosure of architecture details, while commercially and competitively rational, creates an information vacuum that speculative research quickly fills. Whether or not the parameter estimates in the preprint prove accurate upon further scrutiny, the discussion they have generated underscores that benchmark performance — not raw scale — is increasingly the operative measure of model quality, and that the relationship between the two remains poorly understood outside of the laboratories producing these systems.
Read original article →