Hot take: LLMs can’t do statistics. Not really.

Large language models cannot perform statistical inference or symbolic computation at scale due to fundamental architectural limitations, as demonstrated through permutation testing where they approximate or hallucinate rather than compute exact outputs. The author developed a framework for permutation testing on indicators that performs real computation and generates actual p-values, and is currently integrating it into an MCP layer to allow language models to call it as a tool rather than attempting the calculations themselves. The implementation presents design challenges around the MCP interface for quantitative workloads.

Detailed Analysis

A software developer and quantitative researcher has published an argument that large language models are fundamentally incapable of performing rigorous statistical inference, framing this not as a temporary shortcoming but as an architectural constraint inherent to how LLMs process and generate information. The author uses permutation testing — a resampling-based statistical method that requires exact computation over large combinatorial sets of indicators to produce valid p-values — as a concrete illustration of where LLMs fail. Rather than approximating or pattern-matching toward a plausible-sounding answer, permutation testing demands deterministic, exact arithmetic over potentially enormous sample spaces, which LLMs are structurally unable to reliably provide.

The author's proposed solution is not to improve the LLM itself but to decouple the computation entirely, building a dedicated framework for permutation testing that performs real symbolic and numerical computation, then exposing that framework to an LLM through an MCP (Model Context Protocol) layer. MCP, a protocol developed to allow AI models to invoke external tools and data sources in a standardized way, serves here as the interface between the LLM's natural language reasoning capabilities and a rigorous computational backend. This architectural separation acknowledges a growing consensus in applied AI development: LLMs are best understood as orchestration and reasoning layers, not as compute engines for precise mathematics.

The broader point being made aligns with an active debate in the AI research and engineering community about what LLMs are actually good for versus where they produce confident but unreliable outputs. Statistical inference — including hypothesis testing, exact p-value computation, and combinatorial resampling — sits firmly in the latter category. Studies and practitioner reports have repeatedly shown that LLMs will fabricate statistical results, misapply test assumptions, or produce outputs that superficially resemble correct answers while being numerically wrong. This is not a failure of scale or training data volume; it reflects the token-prediction mechanism underlying all current transformer-based models, which has no native representation of numerical precision or exact symbolic state.

The tool-use paradigm the author is pursuing represents a maturing design pattern in production AI systems. Rather than expecting frontier models to internalize domains requiring exact computation — statistics, formal logic, symbolic algebra — engineers are increasingly building hybrid architectures where LLMs handle language, intent parsing, and workflow coordination while dedicated engines handle computation. Frameworks like LangChain, tool-calling APIs from OpenAI and Anthropic, and protocols like MCP are all infrastructure responses to this same recognition. The specific challenge the author identifies in the MCP interface design — how to structure the tool contract so the LLM calls it correctly and interprets results reliably — is a genuine open problem, particularly for quantitative workloads where input schemas and output semantics are complex and domain-specific.

The post closes as a call for collaboration among practitioners working on MCP tooling for quantitative applications, signaling that while the conceptual case for hybrid LLM-plus-computation architectures is well established, the engineering details remain unsolved and fragmented across individual efforts. The request for relevant information sources further underscores that the ecosystem around rigorous quantitative tooling for LLM agents is still nascent. As AI systems are increasingly deployed in scientific, financial, and analytical contexts where statistical validity is non-negotiable, the design of reliable MCP-style interfaces for computational backends is likely to become a significant subfield of applied AI engineering.

Read original article →

Detailed Analysis

Don't Miss a Deploy