[Show & Tell] One domain expert + Claude Code, 18 days, +243,569 lines: shipped an agent-native causal inference framework for Python

A Stanford econometrician and Claude Code collaboratively developed StatsPAI v1.0, a causal inference framework for Python, delivering 243,569 lines of code across 836 public functions and 2,834 tests in 18 days. The expert directed API design and validated correctness by catching critical errors in sign conventions, inference calculations, and edge cases, while Claude Code handled scaffolding and initial implementations, ultimately achieving reference parity with Stata and R.

Detailed Analysis

A Stanford-affiliated econometrician under the handle brycewang-stanford published a detailed accounting of how he built StatsPAI v1.0, a full-featured causal inference library for Python, in 18 days using Claude Code as a primary development accelerant. The project produced 243,569 lines of code across 234 commits, encompassing 836 publicly registered functions, 2,834 tests, and reference-parity validation suites benchmarked against Stata and R. The library includes a Rust-based high-dimensional fixed-effects backend written via PyO3, drafted by Claude Code and reviewed by the human maintainer. The author is explicit that the division of labor was asymmetric and intentional: Claude Code handled scaffolding, test generation, docstrings, and first-draft estimator implementations, while the domain expert retained full authority over API design, numerical tolerances, paper selection, and correctness adjudication.

The author's candid documentation of failure modes is among the most instructive aspects of the post. Three recurring error categories emerged during development: sign convention ambiguity, where Claude Code would silently adopt one of two valid notational choices from the literature and produce plausible but incorrect output; inference miscalculation, where point estimates were often acceptable on first pass but standard errors — particularly cluster-robust sandwich forms, degrees-of-freedom adjustments, and wild-bootstrap weights — were frequently wrong; and unspecified edge cases, including singleton clusters, collinear covariates within partitions, and zero-mass bins in regression discontinuity designs, which papers routinely assume away but real data immediately triggers. These categories collectively illustrate that Claude Code functions as a highly literate but empirically untested collaborator — one that has processed the relevant literature but has never been required to defend a result under scrutiny.

The workflow patterns the author describes represent a meaningful refinement in how domain experts can leverage large language model agents for technical library construction. The test-first loop — in which reference-parity test targets are established before estimator implementation, forcing iterative convergence to a numerical tolerance — proved particularly effective at catching inference errors before they propagated. Feeding entire academic papers and canonical reference implementations as context for each estimator substantially improved first-draft quality compared to generic prompting. The registry pattern, which required every new function to be explicitly registered with a JSON schema, served as an architectural forcing function that exposed hallucinated APIs immediately rather than allowing them to persist undetected in the codebase. These patterns suggest that productive human-agent collaboration in scientific computing depends heavily on the human establishing formal verification checkpoints rather than relying on the agent's output as a terminal artifact.

The project sits at the intersection of two significant trends in AI-assisted software development: the emergence of agentic coding tools capable of sustaining complex, multi-session development workflows, and the growing interest in "agent-native" software architectures designed from the ground up to be discoverable and callable by LLM agents rather than human developers alone. StatsPAI's registry design — where every function exposes a JSON schema — reflects an architectural philosophy aligned with how tools like Claude Code interact with codebases, effectively treating the library itself as a tool surface for downstream AI agents. Anthropic's continued investment in Claude Code and the broader Claude Agent SDK positions this kind of human-AI co-development as a primary use case, with the StatsPAI project offering one of the most granular public post-mortems of what that workflow actually looks like at scale in a rigorous scientific domain.

The honest limitations the author acknowledges are as significant as the accomplishments. Several frontier modules — including Sequential Synthetic Difference-in-Differences, Bayesian Causal Forests for longitudinal data, and proximal surrogate index methods — are validated only by simulation rather than against external reference implementations, because canonical author code does not yet exist for those methods. Docstring quality is uneven, dispatcher signatures show cross-family inconsistencies, and the CHANGELOG already contains correctness-fix tags that signal ongoing numerical revision. These rough edges are predictable consequences of an 18-day development cadence and reflect the current state of the art in agent-assisted library construction: dramatic speed gains accompanied by a quality distribution that requires sustained expert review to normalize. The project is openly recruiting collaborators from econometrics, epidemiology, and causal ML, suggesting the maintainer views the v1.0 release as a foundation to be hardened rather than a finished product.

Read original article →

Detailed Analysis

Don't Miss a Deploy