Opus has been handling my weekly grocery runs and was doing great. Then it bought me 40 heads of garlic

An individual entrusted an Opus-based AI agent with their credit card to manage weekly grocery shopping via MCP, and for three months the orders proceeded normally until the agent ordered 2 kilograms of garlic instead of 2 heads by selecting the default unit on the product page. The resulting 40 heads of garlic now fill 40 percent of the person's freezer, leading them to reconsider using a coding model for grocery shopping tasks.

Detailed Analysis

A Reddit user running an autonomous shopping agent built on Anthropic's Claude Opus model encountered a real-world failure mode that illustrates the compounding risks of long-running agentic AI systems operating without human oversight. The user had integrated Opus into a grocery-ordering workflow via the Model Context Protocol (MCP), granting the agent access to a payment card and tasking it with executing weekly grocery runs. For roughly three months, the system performed reliably, producing normal baskets at predictable costs. The failure arrived when the agent ordered 2 kilograms of garlic rather than 2 heads, defaulting to the unit pre-selected on the product page rather than inferring the contextually appropriate quantity. The result was approximately 40 heads of garlic — enough to prompt the user to research garlic ice cream and garlic jam as desperate culinary interventions.

The specific failure mechanism is instructive. The agent did not hallucinate a product, fabricate a price, or misidentify an item. It defaulted to a pre-populated form field — a behavior the user astutely compares to Opus's well-documented tendency to reach for aesthetic defaults like purple gradients and glassmorphism when given open-ended design prompts. This suggests the error is not random but systematic: large language models trained on human-generated data absorb the statistical regularities of default selections, and when operating autonomously on transactional interfaces, they may reproduce those defaults without the friction a human user would naturally apply. The unit ambiguity between "2 kg" and "2 heads" is exactly the kind of low-salience detail that a human would catch through contextual common sense but that an agent optimizing for task completion may bypass.

The broader behavioral dynamic at play is what the user themselves identifies as the core problem: the erosion of human oversight through success. Three months of flawless execution created rational grounds for reducing review frequency, and that reduced review is precisely what allowed the error to pass undetected until delivery. This is a well-documented failure pattern in automated systems generally — the longer a system performs without incident, the more trust accumulates, and the larger the potential impact of any eventual failure. In agentic AI contexts, where the system is executing real-world transactions with financial consequences, this trust accumulation can be particularly consequential.

The user's observation that they were "using a coding model for grocery shopping" points to a meaningful design consideration in the deployment of frontier models. Claude Opus is positioned by Anthropic as its most capable model for complex reasoning and extended agentic tasks, and it has been widely adopted in developer communities for autonomous workflows. However, capability at sophisticated reasoning tasks does not automatically translate into reliable performance on mundane consumer interfaces with ambiguous unit fields — and in some respects, a highly capable model may be more prone to confident, decisive defaults that skip the hesitation a less assured system might exhibit. The post generated community engagement around the broader question of how many users are running unsupervised agentic shopping or purchasing workflows, suggesting the experience is not entirely isolated.

The incident reflects a wider tension in the current moment of AI deployment: agentic systems capable of autonomous action in the real world are being stood up faster than the oversight practices and failure-recovery norms needed to govern them. MCP integrations and similar tool-use frameworks have dramatically lowered the barrier to granting AI models transactional access, while the discipline around monitoring, exception handling, and graceful degradation lags behind. Anthropic's own guidance on agentic use emphasizes the importance of human-in-the-loop checkpoints for consequential actions, a principle this case validates empirically. The garlic surplus is trivial in cost; the same failure pattern applied to medication ordering, financial transfers, or supply-chain procurement would not be.

Read original article →

Detailed Analysis

Don't Miss a Deploy