← X

Opus 4.6 scores 78.3% on MRCR v2 at 1 million tokens, highest among frontier mod

X · claudeai · March 13, 2026
Opus 4.6 achieves 78.3% accuracy on MRCR v2 benchmark at 1 million tokens, setting a new frontier model standard. The update dramatically expands practical capabilities with support for 600 images or PDF pages per request, enabling developers to load entire codebases and large document sets in single requests. --- *Note: The content also included customer support complaints and off-topic material, which aren't relevant for a learning digest focused on Claude's capabilities and innovations.*

Detailed Analysis

Anthropic's Claude Opus 4.6 has achieved a score of 78.3% on the Multi-Round Coreference Resolution (MRCR) v2 benchmark at the 1 million token context length, marking the highest performance among frontier models on this particular evaluation. The announcement also highlights a significant expansion of practical capabilities: users can now load entire codebases, large document sets, and long-running agent contexts within a single request. Additionally, media handling limits have been substantially increased, with the model now supporting up to 600 images or PDF pages per request — a meaningful leap in multimodal throughput.

The MRCR benchmark is a particularly demanding test of long-context coherence, measuring a model's ability to track and resolve references across extremely long documents or conversation histories. Scoring nearly 80% at the 1 million token threshold is technically significant because performance on such benchmarks tends to degrade sharply as context length increases — models must maintain consistent understanding of entities, references, and relationships across what amounts to several full-length novels' worth of text simultaneously. Opus 4.6 leading frontier models on this metric suggests meaningful architectural or training advances in long-range attention and retrieval fidelity.

The practical implications of these capabilities extend well beyond benchmark performance. The ability to ingest entire codebases in a single context window is directly relevant to software engineering workflows, enabling tasks like holistic refactoring, cross-file dependency analysis, and large-scale debugging without fragmentation across multiple sessions. Similarly, the expansion to 600 images or PDF pages per request opens substantial doors for document-intensive industries such as legal, financial, and medical sectors, where comprehensive review of large corpora in a single pass has historically been a limitation of AI tooling.

This announcement fits within a broader competitive trend among frontier AI developers — including OpenAI, Google DeepMind, and Mistral — to extend context windows and improve long-context reliability as a primary axis of differentiation. While raw parameter counts and benchmark scores on short-context tasks have historically dominated model comparisons, the industry has increasingly recognized that real-world enterprise use cases demand reliable, coherent reasoning across very large inputs. Anthropic's focus on long-context performance, evidenced by its Constitutional AI approach and iterative model scaling, positions Claude as a particularly strong candidate for agentic and retrieval-augmented generation (RAG) applications where sustained context fidelity is mission-critical.

Read original article →