Sonnet 4.6 is now completely unusable

A pro plan subscriber reported that Sonnet 4.6 failed to generate a basic swimlane diagram, hitting an answer length limit after 10 minutes of processing and producing incomplete output. Multiple iterations were required to attempt resolution, consuming session tokens while continuing to deliver unusable results. The post characterized the model as a significant regression in capability and threatened subscription cancellation if performance issues are not addressed.

Detailed Analysis

A frustrated Claude Pro plan subscriber posted to the r/Anthropic subreddit expressing significant dissatisfaction with Claude Sonnet 4.6, alleging that the model failed to complete what the user described as a straightforward task: generating a business swimlane diagram. The post, which carries an emotionally charged tone, describes an experience in which the model spent approximately ten minutes in a "thinking" phase before terminating with an "answer length limit" error, producing only partial and unusable output. After three subsequent iteration attempts, the user's session token allocation was exhausted without a satisfactory result being delivered.

The complaint touches on two distinct but related technical issues that have been sources of broader user frustration with large language models: output length constraints and token-per-session usage caps. Swimlane diagrams rendered in text-based markup languages such as Mermaid or similar diagramming syntaxes can be moderately verbose, but they are not inherently complex by the standards of modern frontier models. The fact that the model reportedly engaged in extended reasoning before hitting a length ceiling suggests a possible mismatch between the model's internal processing behavior and its configured output limits — a configuration or tuning issue rather than a fundamental capability failure. The user's characterization of the problem as "regression" implies a perceived decline in capability relative to earlier Claude versions, which is a notable framing, as model updates do not always produce uniform improvements across all task types.

The broader context here matters considerably. Anthropic has positioned Claude's Pro tier as a premium offering for power users, and expectations around reliability and throughput are correspondingly high. When paying subscribers encounter hard limits on basic generative tasks, it erodes trust in the product's value proposition. Token and output constraints are engineering decisions that balance compute cost against user experience, but when those constraints surface visibly and disruptively mid-task, they create friction that feels punitive rather than protective — particularly to users who are not informed in advance of what those limits mean in practice.

This type of user complaint also reflects a recurring tension in the commercial AI space between rapid model iteration and quality consistency. As companies like Anthropic push new model versions — in this case, moving to a "4.6" versioning scheme — the risk of introducing regressions or behavioral shifts that degrade performance on specific task categories increases. Users who have built workflows around a given model's behavior can find those workflows broken by updates that were intended to improve aggregate performance. The absence of granular patch notes or changelogs explaining what shifted between model versions compounds user frustration, leaving them without recourse or explanation.

The post ultimately illustrates a fundamental challenge Anthropic and peer organizations face as they scale their consumer products: technical capability at the frontier does not automatically translate to reliable, consistent user experience across the full range of subscriber use cases. Managing output constraints, communicating limits transparently, and ensuring that incremental model updates do not silently degrade performance on common professional tasks are product and engineering concerns as much as research ones. User retention at the Pro tier depends heavily on the perception that the product is dependable — a perception this kind of experience directly undermines.

Read original article →

Detailed Analysis

Don't Miss a Deploy