Detailed Analysis
The METR Long Tasks benchmark graph, widely regarded as one of the most influential pieces of evidence cited in discussions about the pace of generative AI advancement, has come under serious scrutiny from multiple researchers who argue that its methodological failures render it fundamentally unreliable. Nathan Witkin, a research writer at NYU Stern's Tech and Society Lab, published a detailed critique in the Substack outlet Transformer, contending that the graph's errors are so numerous and so likely to compound one another in unpredictable ways that the appropriate response is not to attempt corrections but to abandon it entirely as an informational source. The critique has been amplified by cognitive scientist Gary Marcus and AAAI fellow Ernest Davis, who identified additional flaws beyond those Witkin catalogued.
The specific methodological errors Witkin identifies are varied in nature but collectively devastating. Among the most serious is that some human baseline data points were not empirically collected at all but were simply estimated by the authors — a foundational problem that undermines the benchmark's core comparative framework. Compounding this, when human performance data was actually measured, participants were paid hourly, creating a direct financial incentive to work more slowly, systematically biasing human completion times upward and therefore making AI systems appear faster by comparison. The human benchmarker pool was drawn largely from METR employees' professional and social networks, raising serious concerns about representativeness and potential unconscious bias. Additionally, humans who were already familiar with the relevant codebases and tasks completed them 5 to 18 times faster than the unfamiliar workers METR used, meaning the human baseline almost certainly understates what a typical skilled professional would achieve. The benchmark also suffered from train-test contamination, as some tasks had publicly available solutions that would likely have been absorbed into the training data of the very AI models being evaluated.
The significance of these errors extends well beyond a single flawed study. The METR graph has functioned as a cornerstone artifact in public and policy discourse about how rapidly AI capabilities are advancing, informing narratives about labor displacement, regulatory urgency, and the competitive trajectory of AI development. If the graph substantially overstates AI progress relative to human performance — which the documented flaws suggest is likely — then a meaningful portion of the prevailing discourse about AI's near-term impact has been built on a compromised empirical foundation. This matters especially because the graph's visual sophistication and apparent quantitative rigor lend it an air of scientific authority that casual readers are poorly positioned to challenge.
Witkin's critique also points to a broader systemic problem in AI research culture. He and the article's author both identify a pattern in which researchers aggressively generalize from small, unrepresentative samples — particularly from power-users and technically sophisticated peers — while relying on benchmarks that have not been subjected to rigorous external scrutiny. The absence of formal peer review for many influential AI evaluation frameworks means that studies with significant design flaws can circulate widely, acquire citation authority, and shape both industry behavior and public policy before anyone with the expertise and incentive to challenge them does so. This dynamic is not unique to METR but reflects a structural vulnerability in a field that moves rapidly and places high premium on publication speed over methodological rigor.
The episode serves as a pointed illustration of why scientific best practices — including transparent methodology, independent replication, representative sampling, and formal peer review — remain essential even in domains characterized by rapid technological change. The normalization of benchmark-driven AI evaluation without commensurate standards for how those benchmarks are constructed creates conditions in which superficially credible but fundamentally unreliable evidence can distort understanding at scale. As AI systems become increasingly consequential to economic and policy decisions, the quality of the evidence base underpinning those decisions becomes correspondingly more important to scrutinize.
Read original article →