Architecture — Memory Vault AI¶
Audience: AI coding agents, senior contributors, and anyone building integrations. For a quick overview, see the documentation home. For algorithm-level detail, see Memory Logic.
System Overview¶
Memory Vault AI is a middleware system that sits between a user-facing application and any LLM. Its sole job is to manage what the model knows about the user across sessions.
┌───────────────────────────────────────────────────────────────────┐
│ Client Application │
│ (chatbot, IDE plugin, voice assistant) │
└──────────────────────────┬────────────────────────────────────────┘
│ HTTP or Python SDK
┌──────────────────────────▼────────────────────────────────────────┐
│ Memory Vault AI │
│ ┌──────────────┐ ┌────────────┐ ┌──────────────────────────┐ │
│ │ API Layer │ │ SDK │ │ MCP Server │ │
│ │ (FastAPI) │ │ (Python) │ │ (for AI agent tools) │ │
│ └──────┬───────┘ └─────┬──────┘ └────────────┬─────────────┘ │
│ └────────────────┴──────────────────────┘ │
│ │ │
│ ┌───────────────────────▼───────────────────────────────────┐ │
│ │ Core Engine │ │
│ │ Ingestion → Storage → Retrieval → Budget → Prompt Build │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────▼───────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ ChromaDB (vectors) + SQLite (metadata) │ │
│ └───────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
│
┌──────────────────────────▼────────────────────────────────────────┐
│ LLM Provider │
│ Claude / GPT-4 / Ollama / any OpenAI-compatible │
└───────────────────────────────────────────────────────────────────┘
Component Reference¶
1. API Layer (memory_vault/api/)¶
FastAPI application exposing REST endpoints. Handles authentication, request validation, and routes to core engine methods.
Key files:
- main.py — app factory and lifespan hooks
- routes/memory.py — memory CRUD endpoints
- routes/session.py — session management
- middleware/auth.py — API key validation
- middleware/rate_limit.py — per-user rate limiting
Contracts: See docs/api/API_SPEC.md. Do not change endpoint signatures without updating the spec.
2. Ingestion Engine (memory_vault/ingestion/)¶
Processes raw text into structured, embeddable memory chunks.
Pipeline:
raw_text
→ TextCleaner (normalize whitespace, strip PII markers)
→ Chunker (semantic chunking, not fixed-size)
→ EmbeddingModel (sentence-transformers, async batched)
→ ImportanceScorer (novelty + salience → float 0.0–1.0)
→ MemoryRouter (decide: episodic / semantic / working / skip)
Key classes:
- IngestionEngine — orchestrator (public interface)
- SemanticChunker — splits text at natural boundaries
- ImportanceScorer — scores chunks using cosine similarity to existing memory
- MemoryRouter — classifies memory type based on content + score
Decision: why semantic chunking?
See docs/adr/ADR-002-chunking-strategy.md.
3. Storage Layer (memory_vault/storage/)¶
Abstraction over all persistence backends. Feature code must never call ChromaDB or
SQLite directly — always use StorageLayer.
StorageLayer (abstract base)
├── ChromaAdapter — vector storage, similarity search
├── SQLiteAdapter — metadata, session tracking, procedural memory
└── CompositeStorage — coordinates both, ensures consistency
Memory type → backend mapping:
| Memory Type | Vectors (ChromaDB) | Metadata (SQLite) |
|---|---|---|
| Episodic | ✓ full content | session_id, timestamp, importance |
| Semantic | ✓ full content | entity type, confidence, source session |
| Working | in-memory only | session_id, ttl |
| Procedural | ✗ | key-value store in SQLite |
Schema: See docs/specs/DATABASE_SCHEMA.md.
4. Retrieval Engine (memory_vault/retrieval/)¶
Finds the most relevant memories for a given user query.
Pipeline:
query_text
→ QueryEmbedder (same model as ingestion)
→ ANNSearch (ChromaDB approximate nearest neighbor)
→ CandidateFilter (remove stale, low-importance, or irrelevant)
→ CrossEncoderReranker (optional: more accurate relevance scoring)
→ MemoryCompressor (summarize long chunks to save tokens)
→ RecallResult (list of MemoryChunk, total_tokens)
Key tunable parameters (set via config or env):
- top_k_candidates — how many ANN results to fetch (default: 20)
- top_k_return — how many to return after re-ranking (default: 5)
- reranker_enabled — enable cross-encoder re-ranking (default: false, adds latency)
- staleness_days — deprioritize memories older than N days
5. Context Budget Manager (memory_vault/budget/)¶
Enforces token limits so retrieved memories never overflow the LLM's context window.
Algorithm:
1. Count tokens in all retrieved memories using tiktoken (cl100k_base by default)
2. Sort memories by relevance score (descending)
3. Greedily include memories until token_budget is exhausted
4. Return included memories + token usage stats
The budget is set per-call, not globally, so callers can tune per-model.
6. Memory Compression Engine (memory_vault/compression/)¶
Background job that runs when episodic memory for a user exceeds compression_threshold
sessions. Summarizes old episodes to free storage and keep retrieval quality high.
Strategy:
- Group episodic memories by session
- Sessions older than threshold: summarize with LLM into a single semantic memory
- Original episodic memories are archived (not deleted) and marked compressed=True
- Compression runs as a background asyncio task, never blocking request handling
LLM used for compression: Configured via ML_COMPRESSION_MODEL (default: cheapest available).
7. Prompt Builder (memory_vault/prompt/)¶
Assembles the final context block to inject into the LLM prompt.
Output format:
<memory>
[Semantic] Alice is a backend engineer at a fintech startup.
[Semantic] Alice prefers concise answers with code examples.
[Episodic] 2024-01-15: Discussed PostgreSQL migration strategy.
[Procedural] Communication style: direct, technical, no preamble.
</memory>
Format is configurable. The default XML-like wrapper is readable by all major LLMs.
8. Python SDK (memory_vault/sdk/)¶
High-level public interface. This is what end users import.
from memory_vault import MemoryLayer
ml = MemoryLayer(user_id="...", config=MemoryConfig(...))
await ml.save(text, session_id="...")
context = await ml.recall(query, token_budget=1500)
await ml.forget(memory_id="...")
memories = await ml.list(memory_type="semantic")
The SDK is the contract. Anything in memory_vault.sdk is public API.
Breaking changes require a major version bump and docs/api/API_SPEC.md update.
9. CLI (memory_vault/cli/)¶
Debug and admin tooling built with Typer + Rich.
memory-vault memory list --user alice --type semantic
memory-vault memory search --user alice "PostgreSQL"
memory-vault memory delete --id <memory_id>
memory-vault session stats --user alice
memory-vault compress --user alice --dry-run
memory-vault server start --port 8000
10. MCP Server (memory_vault/mcp/)¶
Exposes memory operations as MCP tools, enabling direct integration with Claude Code, Cursor, Windsurf, and any MCP-compatible AI tool.
Exposed tools:
- memory_save — save a memory chunk
- memory_recall — retrieve relevant memories
- memory_list — list all memories for a user
- memory_forget — delete a memory
Guide: See docs/guides/MCP_INTEGRATION.md.
Data Models¶
Core Pydantic models live in memory_vault/models.py:
class MemoryChunk(BaseModel):
id: str
user_id: str
session_id: str
content: str
memory_type: MemoryType # episodic | semantic | working | procedural
importance: float # 0.0 – 1.0
embedding: list[float] | None
created_at: datetime
compressed: bool = False
metadata: dict = {}
class RecallResult(BaseModel):
memories: list[MemoryChunk]
total_tokens: int
budget_used: float # 0.0 – 1.0
class MemoryConfig(BaseModel):
token_budget: int = 2000
top_k: int = 5
compression_threshold: int = 10
embedding_model: str = "all-MiniLM-L6-v2"
storage_backend: Literal["chroma", "qdrant"] = "chroma"
Configuration¶
All configuration flows through memory_vault/config.py using Pydantic Settings.
Environment variables override defaults. See .env.example for all options.
Dependency Graph (no circular imports allowed)¶
cli, api, mcp, sdk
│
core engine (ingestion, retrieval, budget, prompt, compression)
│
storage (adapters)
│
models, config, exceptions, utils
models, config, exceptions, and utils must never import from higher layers.
Architecture Decision Records¶
Key decisions are documented in docs/adr/:
| ADR | Decision |
|---|---|
| ADR-001 | Use ChromaDB as default vector store (not Qdrant) for embedded mode |
| ADR-002 | Use semantic chunking instead of fixed-size chunking |
| ADR-003 | Four memory types modeled after human memory research |
| ADR-004 | Async-first design using asyncio + anyio |
| ADR-005 | Token counting with tiktoken, not character proxies |