Show HN: Another experiment with an Erdos problem and LLMs

A coder tested multiple large language models on an unsolved Erdos problem about density of multiples, with DeepSeek generating a proof after extended thinking time that was then iteratively refined through reviews by Opus and Gemini. The resulting exposition appeared to be a clean statement of a known Davenport-Erdős corollary rather than a novel mathematical discovery. The experiment demonstrated that modern LLMs possess strong capabilities for mathematical reasoning when used collaboratively, though the correctness of the output remains unverified.

Detailed Analysis

A self-described coder with no mathematics background published a Hacker News experiment in which multiple large language models were orchestrated in a collaborative review loop to attempt a proof related to Erdős problem #691, a number-theory question concerning the density of sets of multiples. The poster's methodology involved using DeepSeek in Expert mode to generate an initial proof after roughly 46 minutes of computation, then routing that output through Claude's Opus model for rapid peer review, feeding Opus's critique back into DeepSeek for refinement, and finally subjecting the result to scrutiny by Google's Gemini 1.5 Pro Preview. The resulting output, described by Opus as a "clean exposition of a Davenport–Erdős corollary," does not claim to establish a novel result but rather asserts that the upper and lower densities of a set of multiples coincide, and that density-1 conditions follow automatically without additional constraints — a finding the author openly acknowledges may already be known or may simply be wrong.

The experiment illustrates an emerging pattern of human-orchestrated, multi-model collaboration on technically demanding problems, where no single AI system serves as sole reasoner but instead models are assigned differentiated epistemic roles: generation, critique, and adversarial review. Claude Opus functioned specifically as a fast-turnaround validator, reportedly completing its reviews in seconds, while DeepSeek handled the computationally intensive proof-drafting and Gemini surfaced gaps that Opus had missed. This division of labor mirrors a broader trend documented in more systematic AI-math efforts of 2025–2026, including a session in which Claude Opus 4.6, operating with 131 autonomous subagents, generated thousands of tests and Lean proof sketches against the Erdős problems database — an effort that likewise required human-verified adversarial testing to catch substantive errors like off-by-factor bugs in random sampling.

The broader context of AI engagement with Erdős problems is one of genuine but contested progress. AlphaEvolve offered a partial solution to problem 507 in late 2025, representing what was then characterized as the first LLM-adjacent resolution of an Erdős conjecture, and problem 728 later received the first credited novel AI solution documented on erdosproblems.com. However, this trajectory has also attracted significant skepticism: claims that GPT-5.2 solved problems including #729 were subsequently debunked as instances of models regurgitating already-solved cases absorbed during training rather than generating original mathematical reasoning, a failure mode that Gary Marcus and others have compared to OpenAI's 2019 Rubik's cube episode. The difficulty of distinguishing genuine inference from sophisticated retrieval remains a central and unresolved challenge in evaluating LLM mathematical performance.

The poster's own takeaways — notably that DeepSeek impressed sufficiently to consider switching from Anthropic — reflect a competitive dynamic in the frontier model landscape that is itself newsworthy. Claude's role in this experiment was narrowly that of a reviewer rather than a primary reasoner, and the author's commentary positions Opus as competent and fast but not uniquely capable relative to DeepSeek for extended mathematical generation tasks. This kind of informal, capability-benchmarking experimentation by technically sophisticated but domain-non-expert users is becoming an increasingly common form of stress-testing for frontier models, and the results — even when mathematically inconclusive — generate real signal about relative model strengths across reasoning, speed, and cross-model feedback integration.

What the experiment ultimately demonstrates most clearly is that the tooling and workflows for human-AI mathematical collaboration are maturing faster than the ability to verify their outputs. The coder's candid admission of having no idea whether the generated proof is correct or useful is not merely a disclaimer but a precise description of the epistemic state of this entire class of experiment. Until robust, autonomous proof-verification infrastructure — of the kind that Lean formalization partially addresses — becomes standard practice in these workflows, the gap between impressive-looking mathematical output and validated mathematical knowledge will remain wide, and the field will continue to generate results that are simultaneously technically interesting and epistemically uncertain.

Read original article →

Detailed Analysis

Don't Miss a Deploy