Detailed Analysis
Anthropic has developed BioMysteryBench, a rigorous bioinformatics benchmark comprising 99 real-world dataset questions spanning disciplines including whole genome sequencing, single-cell RNA-seq, ChIP-seq, metagenomics, proteomics, and metabolomics. The benchmark was designed to assess whether Claude models can function as genuine research collaborators on open-ended biological problems, rather than simply retrieving factual information. A defining methodological feature is its method-agnostic evaluation framework: Claude is granted unrestricted tool access — including pip and conda package installation and queries to databases such as NCBI and Ensembl — and is graded solely on the accuracy of final answers, not the analytical path taken to reach them. This design mirrors real scientific workflows, where the route to a solution is often exploratory and nonlinear.
Performance results reveal a meaningful alignment between Claude's capabilities and expert-level biological reasoning. Of the 99 benchmark questions, 76 were classified as solvable by human domain experts, and Claude Opus 4.6 achieved 77.4% accuracy on this subset. More capable models — Claude Sonnet 4.6 and Mythos Preview — improved upon that baseline further. The most striking finding involves the 23 "human-difficult" questions that up to five domain experts were unable to solve: Claude Mythos Preview resolved 30% of these problems, with Opus 4.6 and Sonnet 4.6 solving smaller but still noteworthy fractions. These results suggest that in certain constrained bioinformatics contexts, frontier Claude models are not merely matching human experts but occasionally surpassing the ceiling of current expert capability.
A critical caveat in the results concerns reliability. Across human-solvable questions, 86% of Claude's correct answers were stable across repeated runs, indicating robust reasoning on well-scoped problems. However, that figure dropped sharply to 44% stability on human-difficult questions, highlighting that performance on the hardest tasks is more stochastic and less reproducible. Anthropic's analysis identifies specific cognitive strategies driving Claude's successes: cross-paper knowledge integration, multi-approach triangulation when initial methods yield ambiguous results, and detection of patterns — such as direct sequence-level insights — that human researchers tend to overlook. Complementary external validation came from Genentech and Roche's CompBioBench, where Claude Opus 4.6 scored 81% overall accuracy and 69% on the hardest question tier, reinforcing that the BioMysteryBench results are not artifacts of Anthropic's own evaluation design.
The broader significance of BioMysteryBench lies in what it represents for AI's role in scientific research. Traditional bioinformatics tools are purpose-built for specific pipeline stages — alignment, variant calling, differential expression — and lack the capacity for open-ended reasoning across heterogeneous data types. By demonstrating that Claude can navigate messy, real-world biological datasets with expert-level or better accuracy, Anthropic positions its models as participants in the scientific process itself, not merely assistants to it. This distinction matters enormously: a model that can independently hypothesize, test, and synthesize across biological domains could meaningfully compress timelines in areas such as genomics, drug discovery, and precision medicine.
This development fits into a broader pattern of frontier AI labs constructing domain-specific, expert-calibrated benchmarks to probe the outer limits of model capability. As general-purpose benchmarks like MMLU and HumanEval have become saturated by leading models, the field has shifted toward evaluations that are harder to overfit and more reflective of real-world professional complexity. BioMysteryBench is notable in that it incorporates human expert baselines directly, making performance comparisons interpretable and grounded. The reliability gap on human-difficult questions also serves as an honest signal about current limitations — even as Claude Mythos Preview achieves superhuman solve rates on certain problems, the inconsistency of those results underscores that AI-driven biological research still requires human oversight, iterative validation, and the kind of scientific judgment that stabilizes inference under uncertainty.
Read original article →