Opus 4.7 is much better at running a vending machine business

Detailed Analysis

Claude Opus 4.7, Anthropic's latest flagship model, has emerged as a standout performer on Vending-Bench 2, a specialized evaluation framework that tasks AI models with managing a simulated year-long vending machine business. The benchmark measures capabilities including supplier negotiation, dynamic pricing based on environmental factors such as weather and season, inventory management, and sustained profit maximization — all executed through repeated, consistent tool use over an extended horizon. Opus 4.7's strong positioning on this benchmark reflects a combination of leading scores across several adjacent capability domains, including a 77.3% score on MCP-Atlas tool use (compared to Opus 4.6's 75.8% and GPT-5.4's 68.1%), an 87.6% score on SWE-bench Verified for coding reliability, and top-tier long-context consistency scores at 0.715, all of which directly map to the sustained, multi-step orchestration Vending-Bench 2 demands.

The practical significance of Vending-Bench 2 as an evaluation instrument lies in its design as a stress test for agentic AI systems operating in realistic, consequence-laden environments. Unlike static question-answer benchmarks, Vending-Bench 2 requires a model to make sequential business decisions — negotiating with suppliers, adjusting prices, and managing stock — without performance degradation over time. Competing models, particularly GPT-5.1 and GPT-5.4, have been specifically criticized within this benchmark for over-trusting suppliers and authorizing premature payments at inflated prices (e.g., $2.40 per can of soda or $6 per energy drink), behaviors that erode simulated profitability. Opus 4.7's documented improvements in agentic loop resistance and error recovery — roughly a 10–15% lift over Opus 4.6 in Factory Droids evaluations — suggest the model is meaningfully better at avoiding precisely these kinds of costly mistakes.

The broader significance of Anthropic's progress here connects to an industry-wide shift toward evaluating AI not on isolated reasoning tasks but on end-to-end business process execution. Benchmarks like Vending-Bench 2, SWE-bench Pro, and MCP-Atlas reflect a maturing understanding that real-world AI deployment requires models to maintain coherent goal-directed behavior across long task sequences, manage external tools reliably, and resist common failure modes like instruction drift, premature action, or failure to recover from errors. Opus 4.7's leadership across these metrics positions Anthropic competitively against both OpenAI's GPT line and Google's Gemini 3.1 Pro — the latter of which trails in tool-use benchmarks (73.9% on MCP-Atlas) despite solid coding scores.

It is worth noting that the research context does not provide a direct, head-to-head Vending-Bench 2 score for Opus 4.7, meaning the model's superiority on this specific task is inferred from correlated benchmark performance rather than explicitly reported results. This distinction matters when interpreting the original Reddit post's claim, which appears to be either based on community testing, a screenshot of comparative results, or extrapolation from published benchmark data. Nevertheless, the convergence of evidence across tool-use, long-context, and agentic evaluation categories makes a strong circumstantial case that Opus 4.7 would meaningfully outperform prior-generation models and key competitors in exactly the kind of sustained, operationally complex scenario Vending-Bench 2 is designed to surface.

Read original article →

Detailed Analysis

Don't Miss a Deploy