Detailed Analysis
A Reddit user on r/Anthropic raises an experiential observation that Claude Haiku 4.5 exhibits greater behavioral consistency than its larger siblings, Sonnet 4.6 and Opus 4.7, particularly when extended thinking is enabled. The user's primary use case involves LaTeX markup tasks — requesting specific visual changes through code — and reports that Haiku 4.5 handles these more reliably than Sonnet or Opus, which the user describes as sometimes producing degraded or nearly identical output almost immediately. The post is self-deprecating in tone, but the underlying technical observation is coherent: the user is noticing a pattern where a smaller, faster model outperforms larger ones on a specific, structured task type when extended thinking is activated.
This observation aligns partially with Anthropic's own documented design philosophy for Haiku 4.5. Anthropic has positioned the model around "consistent tool reliability" and low-latency performance in agentic workflows, and the model includes explicit context-window awareness training — meaning it has more precise knowledge of its own memory state during a session. For structured, iterative tasks like LaTeX editing, where precise instruction-following matters more than broad reasoning, these architectural emphases may translate into the kind of behavioral repeatability the user is experiencing. Haiku 4.5 also achieved faster time-to-success in 85% of evaluations and completed tasks 34% faster on average in internal testing, suggesting its consistency advantage may be most pronounced in focused, lower-complexity task loops rather than sprawling multi-step reasoning chains.
The user's frustration with Sonnet and Opus's "adaptive thinking" behavior during extended thinking sessions points to a known trade-off in frontier model design. Larger models with more sophisticated reasoning modes can exhibit what practitioners sometimes call over-deliberation — where the model restructures its approach mid-task in ways that disrupt predictable output formatting or instruction fidelity. This is not a bug per se, but a side effect of models trained to handle ambiguity and complexity dynamically. For constrained, syntax-sensitive tasks like LaTeX, this dynamism can be counterproductive, making a leaner, faster model like Haiku 4.5 functionally superior for that specific workflow even if it ranks lower in aggregate quality benchmarks.
Broader industry trends reinforce this pattern. As AI labs release model families with tiered capability levels, users increasingly discover that the largest model is not always the most appropriate tool for a given task. Internal Anthropic testing showed a 51.4% preference for Sonnet 4.5 over Haiku 4.5 in general quality evaluations, and third-party assessments note that complex multi-file work still favors Sonnet and GPT-5. However, these aggregate preference scores can mask meaningful task-specific reversals. The Reddit user's experience is a practical example of a growing user sophistication: understanding which model tier fits a specific task's demands, and recognizing that consistency and correctness on a constrained problem are not always correlated with raw model scale or capability ranking.
Read original article →