Reducing LLM context from ~80K tokens to ~2K without embeddings or vector DBs

A developer reduced LLM context from approximately 80,000 tokens to 2,000 by extracting structural signals like functions, classes, and routes, then ranking files using token overlap and heuristics rather than embeddings or vector databases. The approach achieved a 97% reduction in context size while keeping relevant files in the top five results 70-80% of the time, with noticeably fewer task retries. The solution runs entirely locally using simple parsing and ranking without external dependencies.

Detailed Analysis

A developer working with large codebases and LLMs has published an open-source tool called **sigmap** that addresses one of the more persistent practical limitations in applied AI development: the inability to fit real-world repositories into a model's context window without degrading response quality. Rather than relying on embeddings or vector databases — the conventional solutions in the retrieval-augmented generation (RAG) ecosystem — sigmap extracts lightweight structural signals from code, specifically functions, classes, and routes, and uses them to build a local index. A simple ranking algorithm combining token overlap, structural signals, and basic heuristics such as recency and dependency graphs is then applied per query, producing a distilled "context layer" of roughly 2,000 tokens from what might otherwise require 80,000 or more.

The empirical results reported are notable in their magnitude and practical implication. Across multiple repositories, the author observed a ~97% reduction in context size, with relevant files appearing in the top-5 results approximately 70–80% of the time, and a measurable drop in the number of retries required per task. The core insight — that *structured context* outperformed raw model scale in many scenarios — challenges a common assumption in the field that larger context windows or more powerful models are the primary levers for improving LLM performance on complex codebases. This aligns with broader findings in prompt engineering research, where concise, hierarchically organized inputs tend to outperform verbose, unfiltered ones, partly due to well-documented "lost in the middle" attention degradation in transformer architectures.

The deliberate avoidance of embeddings and vector databases is both a technical and philosophical design choice. Traditional RAG pipelines introduce significant infrastructure dependencies — embedding model serving, vector store maintenance, similarity search latency — which create friction in local or resource-constrained environments. Sigmap's heuristic-only approach sacrifices some retrieval precision in exchange for zero external dependencies and deterministic, inspectable behavior. This positions the tool in a growing category of "lightweight context management" solutions that prioritize portability and simplicity, echoing techniques like prompt compression, rolling memory buffers, and map-reduce summarization that have been demonstrated in production to reduce token usage by as much as 70% without substantial quality loss.

The open questions the author raises — particularly around where heuristic ranking breaks down relative to embedding-based retrieval, and how to verify grounding in provided context — sit at the frontier of applied LLM engineering. Hybrid approaches combining structural signals with sparse or dense retrieval represent a logical next step, potentially capturing the low-latency benefits of heuristics for common cases while falling back to semantic search for queries that require deeper conceptual matching. The grounding verification question is especially significant: as context compression becomes more aggressive, the risk of models confabulating details not present in the truncated window increases, making output attribution and factual anchoring a critical unsolved problem in production deployments.

Sigmap's contribution reflects a maturing phase in LLM tooling, where practitioners are moving beyond benchmark-optimized architectures and toward pragmatic, domain-specific solutions that work within real infrastructure constraints. The project illustrates that meaningful gains in LLM usability for software engineering tasks can be achieved through careful information architecture rather than scaling alone — a finding with broad implications for cost, latency, and accessibility across the industry. As model providers including Anthropic continue to expand context windows (Claude currently supports up to 200K tokens), the counterintuitive lesson from work like sigmap is that larger windows do not eliminate the need for intelligent context curation; they simply raise the ceiling on how much noise a poorly structured prompt can introduce.

Read original article →

Detailed Analysis

Don't Miss a Deploy