Detailed Analysis
The "car wash test" has emerged as a deceptively simple benchmark for evaluating spatial and commonsense reasoning in large language models, and a Reddit post celebrating a Claude model's success with it has drawn renewed attention to the challenge. The test poses a scenario in which a user's car is at home and asks whether they should walk or drive approximately 50 meters to a car wash — a question that requires the model to recognize that a car must be physically present at a car wash to be washed, and therefore must be driven there first rather than left behind. The post's author declared the result evidence of AGI, though the tongue-in-cheek framing underscores a broader truth: that passing what appears to be a trivially simple logical puzzle still represents a meaningful milestone for AI systems.
Research into Claude's performance on this test reveals a striking inconsistency across model versions. Claude Opus 4.6 and Claude 3.7 Sonnet with extended thinking capabilities passed consistently, while Claude Sonnet 4.5, Claude Sonnet 4.6, Claude Opus 4, and Claude Opus 4.1 all failed, with some models erroneously advising users to walk to the car wash and then drive the car through — a physically incoherent sequence. Among 53 leading AI models evaluated across the broader landscape, only 11 passed on a single run, and a mere 5 maintained consistent correct answers across 10 repeated runs, indicating that even successful results often reflect stochastic flukes rather than robust reasoning.
The car wash test's difficulty stems from a class of reasoning failure sometimes called "default pattern matching," in which models trained on vast corpora of language learn surface-level associations — walking short distances is efficient, car washes are nearby, therefore walk — without grounding their responses in the physical constraints of the real world. The model must not only parse the question but hold in mind an unstated spatial dependency: the object being acted upon (the car) must co-locate with the action environment (the car wash) before the action (washing) can occur. This type of implicit constraint tracking is precisely where current language models, even large frontier ones, frequently break down.
The fact that reasoning-augmented models like Claude 3.7 Sonnet with thinking mode perform markedly better than their non-reasoning counterparts points to a meaningful architectural distinction. Extended chain-of-thought or "thinking" capabilities appear to give models the scaffolding needed to explicitly surface and check physical preconditions before committing to an answer, rather than pattern-matching directly to a response. This aligns with a broader trend in AI development in which inference-time compute — spending more processing steps reasoning through a problem — is proving to be a powerful lever for improving performance on tasks that require multi-step logical coherence rather than factual recall.
For Anthropic specifically, the variable results across the Claude model family underscore that capability is not uniformly distributed even within a single organization's product line, and that model size alone does not predict commonsense spatial reasoning performance. The car wash test, though whimsical in framing, functions as a useful canary for a class of real-world agentic tasks — planning errands, coordinating logistics, operating robotic systems — where failing to track physical object states could produce consequential errors. As Claude-based agents are increasingly deployed in agentic and tool-use contexts, the ability to reliably reason about the physical preconditions of actions becomes less a parlor trick benchmark and more a practical safety and reliability requirement.
Read original article →