Detailed Analysis
ClaudePlaysPokemon, an ongoing public stream in which Anthropic's Claude AI plays the unmodified 1996 game Pokémon Red on a Game Boy emulator, has become one of the more revealing informal benchmarks for frontier model capability in agentic settings. The project originated as a personal learning exercise by David Hershey, an Anthropic Applied AI employee, who began it in June 2024 to develop hands-on familiarity with agent development. It went public in February 2025 coinciding with the release of Claude Sonnet 3.7, and while Anthropic does not own the project, the company subsidizes API costs and actively promotes the stream. The current run, operating on the newly released Opus 4.7, has reached five of eight badges at 15,779 steps — a meaningful acceleration compared to Opus 4.5, which sat at 48,000 steps with the same badge count before eventually clearing all eight badges and reaching Victory Road.
The deliberate minimalism of the harness is central to what makes the run analytically interesting. Claude receives only a screenshot, three tools (button presses, a pathfinding navigator, and a knowledge base), a walkability overlay derived from RAM reads, a secondary LLM that audits its notes file, and markdown notes it maintains itself. No walkthrough data is injected, and the system prompt explicitly instructs Claude to distrust its own Pokémon knowledge, since game details may diverge from training data. This lean scaffolding stands in contrast to competing streams: Gemini Plays Pokémon's harness is described as more elaborate, and the argument made by the project is that Claude's setup constitutes a purer test of raw model cognition rather than scaffolding-assisted performance. The visible reasoning trace — currently executing coordinate-based wall verification to map maze geometry in real time — offers viewers an unusually transparent window into spatial reasoning under uncertainty.
The progression history across model generations serves as an informal capability ladder. Sonnet 3.5 could not exit the player's starting house. Sonnet 3.7 achieved three badges but famously spent over twelve hours navigating a single rock wall in Mt. Moon, an episode that went viral. Sonnet 4 through Sonnet 4.5 made no story progress whatsoever, stalling for months on the Team Rocket Hideout and Erika's Gym. Opus 4.5, released in November 2025, broke the logjam, cleared all eight badges, and reached Victory Road. Opus 4.7 is now pacing at a speed that suggests it may become the first Claude model to complete the game. The step-count compression from 4.5 to 4.7 at equivalent story milestones is being treated by observers as one of the cleanest capability-delta signals yet observed for the new flagship in a sustained agentic context.
The broader competitive landscape frames why this informal benchmark carries weight beyond entertainment. Google's Gemini 2.5 Pro completed Pokémon Blue in May 2025, and OpenAI's GPT-5 cleared the longer Pokémon Crystal in roughly 9,500 steps the following August. Claude has not yet beaten Pokémon Red, a deficit that is partly attributable to Hershey's deliberate restraint in harness design rather than solely to model capability. This creates a methodological tension common across AI evaluation: richer scaffolding produces more impressive results but obscures the contribution of the underlying model. The ClaudePlaysPokemon project's conscious choice to keep scaffolding minimal positions it as a counterpoint to more heavily engineered demonstrations, offering a longitudinal dataset — now spanning nearly two years of runs across multiple model generations — that more directly tracks the raw reasoning improvements Anthropic has shipped.
The project also illustrates how informal, community-facing experiments can accumulate genuine scientific signal over time. What began as one engineer's internal Slack updates has evolved into a multi-year series of controlled comparisons, with consistent harness design serving as the fixed variable across model generations. The combination of public streaming, visible reasoning traces, and a well-understood task domain gives the project an unusual ability to communicate capability improvements to both technical and non-technical audiences simultaneously — a function that formal benchmark leaderboards rarely achieve with comparable accessibility or narrative clarity.
Read original article →