Glia – Local-first shared memory layer (SQLite-vec + FTS5 + Offline Knowledge Graph)

Glia is an offline, local-first RAG and memory layer designed to connect AI web chats like Claude and ChatGPT with local developer tools through a SQLite database backend. The system employs hybrid search combining sqlite-vec embeddings with FTS5 keyword matching, and features sentence-level trimming that reduces prompt bloat by 90-95% along with local knowledge graph extraction. Powered entirely by local Ollama instances, the open-source project eliminates the need for Docker containers or third-party APIs and can be set up with a single command.

Detailed Analysis

Glia is an open-source, local-first memory and retrieval-augmented generation (RAG) layer built by developer Eshaan Nair that aims to unify AI conversation history and context across both web-based AI chat interfaces and local developer tooling. Released under the MIT license and deployable via a single `npx glia-ai-setup` command, the project uses a Node.js and SQLite architecture centered on two complementary retrieval mechanisms: sqlite-vec for 768-dimensional float32 vector embeddings generated by a locally running Ollama instance (using the nomic-embed-text model), and SQLite's native FTS5 extension for keyword-based prefix matching with porter stemmer normalization. The system operates entirely offline, explicitly avoiding Docker container dependencies and third-party memory APIs that have become common in competing solutions. Browser extension support spans Claude.ai, ChatGPT, DeepSeek, Gemini, Grok, and Mistral, while an MCP (Model Context Protocol) server bridges the same SQLite backend to terminal agents and IDE-integrated coding tools like Cursor and Windsurf.

Several engineering decisions in Glia distinguish it technically from standard RAG implementations. The project employs surgical sentence-level trimming, whereby retrieved chunks are decomposed into individual sentences and only the highest-relevance sentences are forwarded to the language model rather than entire paragraphs — a design the author benchmarks as reducing LLM prompt token consumption by roughly 90–95%. Additionally, Glia incorporates HyDE (Hypothetical Document Embeddings), a technique in which the system generates a hypothetical answer to a query before embedding, effectively bridging semantic gaps between the literal query text and the latent space of stored documents. An offline task queue uses a locally running llama3.1:8b model to extract entity-relation-object triples from stored content, persisting them in a SQLite facts table or optionally in Neo4j, and fusing knowledge graph scores with vector retrieval scores at query time. SQLite's WAL (Write-Ahead Logging) mode enables concurrent read/write access across the browser extension dashboard and active MCP sessions without contention. Pre-storage PII redaction — scrubbing JWTs, API keys, email addresses, and IP addresses — is handled at the extension layer before any data reaches the database.

The significance of Glia lies in its direct response to a structural gap in how developers currently interact with AI systems: the absence of persistent, portable memory that follows a user across different AI interfaces and local development environments without requiring cloud infrastructure or vendor lock-in. As AI assistants have proliferated across web platforms and integrated development environments, developers increasingly work across multiple systems — using Claude.ai for general reasoning, Cursor for code generation, and command-line agents for automation — without any shared context layer connecting them. Existing solutions to this problem tend to rely on proprietary memory APIs, heavyweight orchestration platforms, or cloud-hosted vector databases, all of which introduce privacy concerns, latency, and cost. Glia's SQLite-centric approach represents a deliberate architectural bet that the performance and feature capabilities of embedded databases — particularly following the sqlite-vec extension's emergence as a viable vector search backend — have crossed a threshold where they can serve production-grade RAG workloads on consumer hardware.

The project also connects to a broader developer movement toward local-first AI infrastructure that has accelerated in parallel with the mainstreaming of consumer-grade LLMs. The adoption of Ollama as a local model runtime has lowered the barrier for running embedding models and generative LLMs without GPU cloud access, enabling architectures like Glia's that treat the local machine as the primary compute and storage substrate. The MCP protocol, which Anthropic introduced as a standardized interface between language model agents and external tools, plays a structural role here: by exposing the Glia backend as an MCP server, the project integrates natively with the growing ecosystem of MCP-compatible coding agents, positioning shared memory as a first-class concern in agentic workflows rather than an afterthought bolted onto individual tools. The knowledge graph component — offline entity extraction fused with vector retrieval — reflects growing recognition in the research community that purely embedding-based RAG has systematic weaknesses in relational reasoning, and that hybrid retrieval combining dense vector search with structured symbolic representations can meaningfully improve answer quality on context-dependent queries.

Read original article →

Detailed Analysis

Don't Miss a Deploy