I built "Semvec": A Constant-Cost Semantic Memory for LLMs (Looking for testers!)

Semvec is a semantic memory system that replaces unbounded conversation histories with fixed-size compressed context and tiered memory to maintain constant token costs and latency across LLM conversation turns. The tool achieves approximately 76% token reduction in 48-turn benchmarks while providing drop-in compatibility with OpenAI-compatible LLMs, MCP server integration for coding agents, and multi-agent coordination capabilities.

Detailed Analysis

Semvec is a newly released open-source Python library designed to address one of the most persistent structural limitations in large language model application development: the unbounded growth of conversation history and its compounding effects on token cost, latency, and contextual coherence. Built by an independent developer and shared in the Claude AI subreddit community, the tool replaces conventional rolling conversation logs with a fixed-size semantic state paired with a tiered, content-aware memory architecture that categorizes context into short-, medium-, and long-term layers. The developer reports that in 48-turn benchmark tests, Semvec achieves approximately a 76% reduction in token usage while preserving structured access to decisions, error patterns, and prior context — with the key architectural promise being that the input footprint for turn 10 and turn 10,000 remains identical.

The library's design centers on several practical components aimed at real-world deployment scenarios. Its drop-in chat proxy functionality allows developers to wrap any OpenAI-compatible LLM endpoint — including local inference servers like vLLM and Ollama, as well as cloud routers like OpenRouter — and receive compressed context without modifying application logic. Particularly notable for Claude users specifically is the inclusion of a native Model Context Protocol (MCP) server, enabling persistent memory across coding sessions in Claude Code and Cursor IDE. The tiered memory system employs a selective forgetting mechanism where access frequency, not recency alone, determines what is retained, meaning older but frequently referenced memories can outlive newer but untouched ones — a meaningful departure from simple sliding-window or recency-based truncation strategies.

The release arrives at a moment of significant industry-wide tension around context window economics. While frontier model providers including Anthropic have progressively expanded context window sizes — Claude models now supporting hundreds of thousands of tokens — the practical cost and latency penalties of filling those windows remain substantial for production applications, particularly in agentic workflows involving many sequential turns. Semvec positions itself not as a replacement for long-context models but as a cost-control and efficiency layer that makes constant-time inference economically viable at scale, which is a meaningfully different value proposition than simply having a larger window available.

The multi-agent coordination feature, exposed through the `semvec.cortex` module, reflects a broader architectural trend in AI systems toward shared state and inter-agent communication. As autonomous agent frameworks grow more prevalent — particularly in coding and research workflows where multiple specialized agents operate in parallel — the need for lightweight, structured mechanisms to synchronize context without redundant token expenditure becomes increasingly acute. Semvec's state vector exchange model represents one developer's attempt to solve this coordination problem at the memory layer rather than at the orchestration layer, which could prove complementary to existing frameworks like LangGraph or AutoGen rather than competitive with them.

The project is currently in an active testing phase, with the developer explicitly soliciting feedback from developers working on RAG pipelines, chatbots, and IDE-integrated coding agents. Available via PyPI for Python 3.10 through 3.14, the library's accessibility and its specific out-of-the-box support for Claude Code and Cursor place it squarely within the growing ecosystem of developer tooling that extends and augments Anthropic's products rather than building on them from scratch. Whether Semvec's compression guarantees hold across diverse domain-specific conversations and adversarial edge cases will be the central question its testing community will need to answer, but its architectural framing of memory as a tiered, frequency-weighted semantic structure rather than a temporal log represents a technically serious contribution to the applied LLM tooling space.

Read original article →

Detailed Analysis

Don't Miss a Deploy