Detailed Analysis
A developer's side-by-side double pendulum benchmark revealed a revealing behavioral difference between Claude and GPT-4o: the two models independently selected opposite coordinate conventions for measuring the angle theta in their respective simulations, producing physically divergent outputs that became visually apparent within seconds of runtime. Claude measured theta from the upward vertical, meaning theta=0 corresponds to the pendulum arm pointing straight up, while GPT-4o measured from the downward vertical, the more common textbook convention where theta=0 represents the stable hanging equilibrium. Because both simulations were rendered through a single shared host drawing module that reads raw theta values without transformation, the divergence in convention translated directly into a divergence in visible pendulum behavior — not an artifact of styling or rendering logic, but a genuine difference in the underlying physics being computed.
The benchmark in question, Physics Bench, is an open-source project designed specifically to eliminate rendering as a confounding variable in model comparisons. Each model implements only three functions — step, getInfo, and reset — while the host owns all drawing responsibilities. This architecture provides a strong guarantee: any visual difference between panels is attributable solely to differences in simulation logic. The discovery therefore carries methodological weight beyond a casual observation. It demonstrates that ambiguity in a prompt's specification — in this case, an unspecified angle convention — can propagate silently through an entire codebase, affecting equations of motion, initial conditions, and integration logic, while remaining internally consistent at every step. A unit test of the mathematics would pass for either model independently; the inconsistency only surfaces through comparative rendering.
What distinguishes Claude's behavior in this instance is its transparency about the interpretive choice it made. Claude's generated code included explicit comments documenting its convention selection, making the decision traceable in the conversation inspector. GPT-4o, by contrast, silently adopted the down-vertical convention without annotation. Both choices are defensible — the up-vertical convention is standard in some control theory and robotics literature — and neither model produced incorrect physics for its chosen frame. The difference lies not in correctness but in explicitness: Claude surfaced an ambiguity that the prompt left unresolved, while GPT-4o absorbed it into the implementation without acknowledgment.
This episode illustrates a broader challenge in evaluating large language model outputs in technical domains: internal consistency is a necessary but insufficient criterion for correctness relative to a user's intent. A model can produce code that is mathematically rigorous, well-structured, and passes all reasonable unit tests, and yet diverge significantly from another model's equally valid interpretation of the same underspecified problem. In physics simulations, where coordinate conventions, sign conventions, and reference frames are frequently left implicit, this class of silent interpretive divergence has practical consequences. The gap between "correct given an assumed convention" and "correct relative to what was intended" is exactly the kind of error that rigorous benchmarking infrastructure, like the one this developer built, is designed to expose.
The broader implication for AI-assisted scientific and engineering work is that prompt specificity about conventions and reference frames matters considerably more than it might in prose generation tasks. Models operating on ambiguous physical specifications will make choices — and those choices, if undocumented, can be difficult to audit after the fact. Claude's habit of commenting on its convention selection represents a form of epistemic transparency that aids downstream debugging and review. As AI models are increasingly embedded in technical workflows where interpretive divergences carry real consequences, the ability to surface and annotate implicit assumptions may prove as valuable as the correctness of the computation itself.
Read original article →