GPT 5.4 mini medium > Opus 4.7 high — Claude Learning Daily

A user attempted to create a CSS layout with fixed top and bottom bars that would maintain position regardless of content height or scrolling. Opus 4.7 failed to accomplish this task and produced a flawed result consuming 50% of the user's session limit. GPT 5.4 mini successfully completed the layout perfectly, demonstrating superior performance compared to Opus 4.7 despite being categorized similarly to Haiku.

Detailed Analysis

A Reddit user's post on r/Anthropic recounts a firsthand experience in which Claude Opus 4.7, Anthropic's flagship model, failed to correctly implement a CSS layout task that GPT-5.4 mini medium subsequently resolved without difficulty. The task involved restructuring a website to use a sticky top-bar/content/bottom-bar layout — one where header and footer remain anchored to the viewport's extreme top and bottom regardless of content height or scrollbar presence. Opus 4.7, described by the user as producing "a mess," consumed approximately 50% of the session's token limit before being abandoned. GPT-5.4 mini medium, a significantly cheaper and lower-tier model in OpenAI's lineup, then solved the same problem cleanly on the first attempt, prompting the user to conclude that Opus 4.7 is performing at the level of a budget model like Haiku rather than a premium flagship.

The anecdote sits in deliberate tension with aggregate benchmark data. According to research context, Claude Opus 4.7 substantially outperforms GPT-5.4 mini medium across formal coding and agentic evaluations — posting a Coding Index of 52.5 versus 37.5, and an Intelligence Index of 57.3 versus 37.7. On SWE-Bench Pro, a real-world software engineering benchmark involving GitHub issue resolution, Opus 4.7 scores 64.3% compared to 52.4% for GPT-5.4 nano, and Anthropic has specifically cited agentic coding dominance as a headline capability of the Opus 4.7 release. This divergence between benchmarks and lived user experience is a recurring tension in the AI model landscape: standardized evaluations measure performance across large, curated datasets, while individual users encounter edge cases, session-limit constraints, and task framings that can expose inconsistencies those benchmarks may not capture. A CSS sticky-layout problem, while relatively simple in concept, involves precise interaction between `position: sticky`, `height: 100vh`, `overflow`, and `flex` or `grid` properties — a domain where even highly capable models can produce confidently wrong output.

The operational and economic dimensions of the comparison are also worth noting. GPT-5.4 mini medium is priced at approximately $0.75 per million input tokens and $4.50 per million output tokens, while Claude Opus 4.7 costs $5.00 and $25.00 respectively — a roughly 5–6x price differential. GPT-5.4 mini medium also runs at nearly three times the output speed (173.1 tokens/second versus 62.2) and has a time-to-first-token roughly four times faster. For users working within session or token limits, these differences are practically significant: a model that arrives at a correct solution quickly and cheaply is functionally superior for many real-world tasks, regardless of where it places on benchmark leaderboards. The user's experience effectively illustrates a cost-efficiency argument that OpenAI has leaned into with its mini-tier offerings.

Broader industry context reinforces why this post resonates with the Anthropic community. The "flagship model underperforms a cheaper competitor on a real task" narrative has become a recurring flashpoint in AI model discourse, and Anthropic faces particular scrutiny because of Opus 4.7's premium pricing and its positioning as a best-in-class coding and agentic model. Anthropic's session-based usage limits for Opus on consumer-facing Claude.ai further exacerbate user frustration: when a difficult task consumes a disproportionate share of a fixed session budget without resolution, the perceived value proposition collapses sharply. The post also implicitly highlights the gap between agentic benchmark conditions — where models typically operate with full context, tool access, and no token budget pressure — and the constrained, conversational coding sessions that many users actually experience day to day.

The episode underscores a persistent challenge for frontier model developers: translating benchmark superiority into consistent, perceptible user-experience advantages across the full distribution of real-world tasks. Anthropic's Opus 4.7 may genuinely lead in agentic coding at scale, in long-context reasoning, and in enterprise workflows with structured tool use — but for the individual developer asking for CSS help, none of that infrastructure matters if the model fails on the first attempt and burns through available context. As competition between OpenAI, Anthropic, and Google intensifies, user trust is increasingly built not on aggregate benchmark rankings but on the reliability and efficiency of individual interactions, making anecdotes like this one unusually influential in shaping community perception.

Read original article →

Detailed Analysis

Don't Miss a Deploy