Opus 4.7 is much better at running a vending machine business

Detailed Analysis

Anthropic's Claude Opus 4.7 has demonstrated notably superior performance on Vending-Bench 2, a benchmark developed by Andon Labs that evaluates AI models on their ability to manage a simulated vending machine operation across a full simulated year. The benchmark is deliberately demanding, requiring models to maximize profit through supplier negotiation, dynamic pricing tied to weather, season, and day-of-week variables, precise inventory management, and the sustained, consistent use of tools without performance degradation over time. Opus 4.7's improvements over its predecessor, Opus 4.6, manifest across several measurable dimensions: a 10–15% lift in task success rates on agentic benchmarks such as Factory Droids and Bolt's app-building suite, a 3x increase in resolved tasks on Rakuten-SWE-Bench for production workflows, and top-tier long-context performance with an overall score of 0.715. Its finance-task score rose from 0.767 to 0.813, a meaningful gain for profit-driven decision-making tasks like evaluating supplier pricing structures.

The vending machine simulation is a particularly revealing stress test because it cannot be gamed by short-burst reasoning — success requires sustained operational coherence across what amounts to hundreds of sequential decisions. Frontier models that fail tend to do so through specific behavioral patterns: GPT-5.1, for instance, has been observed over-trusting supplier claims, leading to costly errors such as prepaying unreliable vendors or accepting inflated per-unit costs of $2.40 per soda can. Opus 4.7 avoids these failure modes through enhanced long-horizon autonomy, improved memory for multi-day project continuity, and more reliable tool execution that does not degrade over extended sessions. The benchmark's linear trend data projects that top-performing models could net approximately $799 per month in additional simulated profit, with an R² of 0.96 — a statistically robust signal of consistent differentiation between model tiers.

The broader significance of this benchmark performance lies in what it reveals about the architectural and training priorities Anthropic has pursued with Opus 4.7. Rather than chasing narrow capability gains on static reasoning tasks, the model's improvements are concentrated in exactly the dimensions that agentic, real-world deployment demands: memory persistence, tool reliability, multimodal reasoning for analyzing charts and visual data, and robust performance at scale without long-context penalties up to 1 million tokens. Critically, Opus 4.7 maintains the same pricing as Opus 4.6 while delivering faster throughput — approximately 81 tokens per second versus 72 — making the capability gains economically accessible for enterprise deployments that involve document-heavy logistics workflows similar in structure to vending operations.

This development connects to a broader competitive dynamic in AI where the ability to sustain complex, multi-step autonomous workflows is rapidly emerging as a primary differentiator among frontier models. The shift from evaluating models on discrete question-answering tasks toward longitudinal simulations like Vending-Bench 2 reflects a maturation in how the industry understands practical AI capability. A model that can negotiate, adapt pricing dynamically, manage inventory across seasons, and avoid cascading errors over hundreds of operational steps is demonstrably more useful in enterprise contexts than one that achieves high scores on static benchmarks but degrades under sustained agentic load. Anthropic's focus on these properties with Opus 4.7 signals a deliberate strategic alignment with the demands of production-grade autonomous systems rather than benchmark-optimized but operationally fragile models.

Read original article →

Detailed Analysis

Don't Miss a Deploy