Does anyone have a List Of Questions AI Confidently gets incorrect?

A discussion post requests examples of questions that AI models answer with confident but incorrect reasoning, citing cases like counting letters in words and comparing the density of gold and uranium. The poster notes that frontier AI models frequently provide well-structured explanations that lead to wrong conclusions, despite seemingly logical approaches. The post seeks to compile a comprehensive list of such questions where modern AI systems consistently fail.

Detailed Analysis

A Reddit thread in the r/ClaudeAI community has surfaced a recurring area of concern among AI users: the tendency of frontier language models to answer factually incorrect questions with confident, well-structured reasoning. The original poster highlights several canonical examples of this failure mode, including the now-famous "how many R's in strawberry" prompt, a deceptively simple question about whether to walk or drive 10 meters to a car wash, a query about how many days of the week contain the letter "d" (the answer being all seven), and a newer entry involving a false-premise question about gold and uranium density. In the uranium example, the question asserts that gold is less dense than uranium — a false premise, since gold (~19.3 g/cm³) is marginally denser than uranium (~19.1 g/cm³) — yet many AI models reportedly produce elaborate, confident explanations validating the incorrect framing rather than rejecting the premise outright.

These failure cases illuminate two distinct but related cognitive weaknesses in large language models. The first is presupposition acceptance, in which models fail to challenge the embedded false assumptions of a question and instead construct plausible-sounding explanations that rationalize the incorrect framing. The gold-uranium example is a textbook instance of this: rather than identifying that the question's premise is factually inverted, models generate authoritative-sounding chemistry explanations for a situation that does not exist. The second failure mode involves character-level and spatial reasoning tasks, such as counting letters in words or evaluating obvious physical proximity, areas where the statistical pattern-matching architecture of transformer-based models performs poorly relative to human intuition. These are not edge cases — they are systematic blind spots.

The broader significance of this community-driven inquiry lies in its implicit critique of how AI confidence is communicated to users. When a model provides a well-formatted, articulate, and logically coherent response to a flawed question, users have few surface-level signals to distinguish a correct answer from a sophisticated fabrication. This is the core danger of what researchers call "hallucination with high confidence" — the model's fluency actively undermines the user's ability to apply skepticism. The community thread represents a form of informal adversarial testing, crowdsourcing the discovery of failure modes that formal benchmarks often miss because they tend to reward performance on well-formed questions rather than probing responses to misleading or false-premise prompts.

This phenomenon connects to a well-documented tension in the development of frontier AI systems: the optimization for fluency and coherence can come at the cost of epistemic humility. Models trained on vast corpora learn that authoritative, complete-sounding answers are rewarded, which can discourage the kind of hedging or premise-challenging behavior that would actually be more accurate in ambiguous or adversarial scenarios. Anthropic and other AI developers have made explicit efforts to address this through constitutional AI approaches and reinforcement learning from human feedback that rewards calibrated uncertainty, but the examples catalogued in this thread suggest the problem remains meaningfully unsolved across frontier models. The persistence of failures on questions like the strawberry "R" count — a problem widely documented for years — indicates that scaling alone does not reliably eliminate these systematic reasoning gaps.

The crowdsourced nature of this thread also reflects a growing genre of community-driven AI evaluation that sits alongside, and sometimes ahead of, formal academic benchmarking. As AI systems are deployed across higher-stakes domains, the failure modes identified through informal adversarial prompting by engaged user communities carry real practical weight. A model that confidently explains why gold is less dense than uranium using plausible-sounding chemistry represents not just a trivia error, but a demonstration of the conditions under which AI outputs can mislead users who lack the domain knowledge to independently verify claims — which is precisely the situation in which users are most likely to rely on AI in the first place.

Read original article →

Detailed Analysis

Don't Miss a Deploy