AGI is Here. Anthropic Just Proved It.

Anthropic's report indicates Claude AI now writes over 80% of the company's code and achieved 76% success on open-ended problems—those with no clear specifications—compared to 26% six months prior. The model can autonomously work for extended periods, with recent tasks lasting up to 12 hours, and now outperforms human researchers in selecting optimal next steps on research projects. These capabilities suggest that practical general artificial intelligence—systems that can solve novel problems independently when given directional guidance—has already arrived.

Detailed Analysis

Anthropic's internally published report, titled "When AI builds itself," has drawn significant attention by revealing that more than 80% of the code the company ships is now written by Claude, its own AI system. The report presents a series of internal performance metrics that track Claude's capabilities across a tiered taxonomy of coding tasks — from trivial to routine to substantial to open-ended — with open-ended tasks defined as those where no clear specification exists and even the engineers are uncertain what a successful outcome should look like. On these most difficult, ambiguous tasks, Claude's success rate climbed from 26% to 76% in just six months, a 50-percentage-point jump that represents a qualitative shift in what the system can independently accomplish. The report also documents that the maximum duration of autonomous, uninterrupted AI work has roughly doubled every four months: from four-minute tasks two years ago, to 90-minute tasks a year ago, to 12-hour tasks currently, with one internal model reportedly working for 16 consecutive hours. Anthropic projects that by 2027, AI systems could handle tasks that would occupy a skilled human for weeks.

A separate experiment described in the report involved freezing 129 real research projects at decision points and asking Claude which direction to pursue next, then comparing its choices against those made by human researchers. In November, Claude selected the better course of action 51% of the time; by April, that figure had risen to 64%. Anthropic also notes that its typical engineer is now shipping eight times as much code per day as in 2024, a productivity multiplier that — while not automatically indicative of quality — illustrates the pace at which AI-assisted output is scaling. These figures, drawn from Anthropic's own internal operations, are notable precisely because they are not benchmark scores on curated academic tests but empirical measurements from a production engineering environment.

The significance of these findings extends well beyond software development benchmarks. The shift from AI as a responsive tool — answering queries, summarizing text, generating drafts — to AI as an autonomous agent capable of extended, self-directed work on problems without predefined solutions marks a meaningful threshold in the technology's practical capabilities. The distinction between narrow AI, which excels at a fixed task, and something more general is often debated in abstract terms, but Anthropic's data grounds that debate in concrete operational outcomes. When a system can be handed an ambiguous mandate, design its own approach, execute for twelve hours, and succeed nearly three-quarters of the time, the traditional framing of AI as a narrow tool begins to strain.

Whether one applies the label "AGI" to this milestone depends heavily on definitional choices that remain genuinely contested in the research community. Anthropic itself does not use the term in the report. Consciousness, emotional experience, cross-domain transfer, and generalized reasoning are attributes that many researchers consider prerequisite to true general intelligence, and none of Anthropic's metrics speak directly to those questions. Nevertheless, the practical threshold being described — a system that can take an open-ended problem, self-direct a research and experimentation process, and return a working solution — is the functional definition that has long been used in engineering and product contexts as the meaningful bar for transformative AI capability.

These developments arrive amid a broader industry-wide acceleration in which frontier AI labs, including OpenAI, Google DeepMind, and Meta AI, are all racing to extend autonomous agent capabilities and context windows capable of supporting long-horizon tasks. Anthropic's willingness to publish detailed internal performance data distinguishes this report from the marketing-inflected benchmarks more commonly released publicly, and the specificity of the metrics — tied to real engineering workflows and decision-making experiments — lends them unusual evidential weight. The trajectory described, with capability doublings measured in months rather than years, suggests that the societal and economic implications of autonomous AI agents will arrive on a compressed timeline that may outpace the policy, regulatory, and organizational frameworks currently being developed to govern them.

Read original article →

Detailed Analysis

Don't Miss a Deploy