Built an agentic RAG over my Obsidian vault so Claude could read engineering books I never have time for. Then I built the eval harness to check Claude wasn't lying to me.

A developer built a RAG system connecting Claude to engineering books in an Obsidian vault, using cheaper models for retrieval to reduce token costs from 50k to 5k per question. The system's agent sometimes generated plausible but incorrect answers, prompting the developer to create an evaluation harness using Claude Sonnet 4.6 as a judge. Four iterations of rubric refinement, particularly collapsing uncertain middle categories and adding a bucket for correct answers from equivalent non-canonical passages, improved judge-human agreement from 39% to 94%.

Detailed Analysis

A developer working on personal knowledge management built a two-stage agentic pipeline to extract value from engineering books they lacked time to read manually, combining a low-cost retrieval agent with Claude as the reasoning layer. The architecture routes PDF-converted markdown files through an Obsidian vault, where a Kimi K2.5 agent performs BM25 retrieval to surface relevant chunks before Claude ever sees any text. This design reduced token consumption per query by roughly 90 percent — from approximately 50,000 tokens to 5,000 — by ensuring Claude operates only on retrieved context rather than full documents. The technical choice of BM25 over vector embeddings was deliberate: in technical and literary corpora, the developer argues, vocabulary overlap between queries and documents is high enough that semantic embedding layers add marginal retrieval benefit while increasing complexity.

The more substantive contribution of the project is the evaluation harness built to detect confident hallucinations — cases where the system produced plausible but factually incorrect citations, such as misattributing a Marcus Aurelius passage to the wrong book and section. The developer used Claude Sonnet 4.6 as the judge model, explicitly choosing it to avoid the circular problem of a model grading its own outputs, since the retrieval agent runs on Kimi. What followed was a four-iteration rubric refinement process that surfaced a non-obvious problem in evaluation design: both the LLM judge and the human grader independently collapsed ambiguous cases into the same middle-score bucket (0.7), producing agreement numbers that looked strong but were actually measuring shared bias rather than genuine calibration. The fix required eliminating the ambiguous middle bucket and introducing a new 0.9-score category for a specific edge case — a correct answer sourced from a non-canonical but equivalent passage — which the prior rubric had forced into either a false positive or a false negative. This single structural change moved judge-human agreement from 7 out of 18 rows (39%) to 17 out of 18 (94%).

The negative result embedded in the write-up is arguably as instructive as the successes. The same chunking technique that improved retrieval quality by 33 percentage points on one corpus degraded performance by 17 percentage points on a second corpus evaluated under the same harness. This asymmetry illustrates a broader problem in RAG system development: optimization decisions that appear generalizable often encode assumptions about document structure, vocabulary density, or query type that don't transfer across corpora. The fact that the eval harness caught the regression on the first run is the developer's implicit argument for why evaluation infrastructure should be built before, or at minimum alongside, system tuning — not as an afterthought.

The project sits at the intersection of several active areas in applied AI development. LLM-as-judge evaluation patterns have become common in production RAG systems, but rubric design remains largely artisanal, and the shared-bias failure mode the developer identified — where judge and human converge on the same miscalibrated scoring behavior — is underexplored in published literature relative to raw agreement metrics. The developer's framing of this as a calibration worksheet problem, with per-row shift tracking across rubric versions, reflects an emerging methodology in evaluation engineering that treats judge behavior as a design artifact to be iterated on rather than a ground-truth proxy to be accepted. The single-grader limitation and small sample size (18 rows) are acknowledged honestly, and the next planned work on adversarial slices suggests awareness that agreement on clean cases is a necessary but insufficient condition for a robust eval harness.

More broadly, the workflow described reflects a practical response to a structural tension in using frontier models like Claude for intensive knowledge work: the context window is large enough to be powerful but expensive enough to require intelligent gating. The pattern of using a cheaper, faster model for retrieval and routing while reserving a more capable model for synthesis and judgment — and then using that same capable model in a separate judge role with a different prompt — is becoming a standard architectural template in agent pipelines. The developer's experience reinforces that the hardest problems in this architecture are not retrieval mechanics or model selection but evaluation methodology: determining, at scale and without manual verification of every output, whether the system's confident answers are actually correct.

Read original article →

Detailed Analysis

Don't Miss a Deploy