Skip to content

ADR-002: Semantic Chunking over Fixed-Size Chunking

Date: 2024-01
Status: Accepted
Deciders: Core team


Context

When ingesting text into memory, we must split it into storable units ("chunks"). Two primary strategies exist: fixed-size chunking (split every N tokens) and semantic chunking (split at natural content boundaries).

Decision

Use semantic chunking — split at double-newlines first, then at sentence boundaries for long paragraphs, then merge short adjacent chunks.

Rationale:

Fixed-size chunking is simpler to implement but produces semantically incoherent chunks. A sentence like "My preferred database is PostgreSQL. I use it for all projects." might be split as "My preferred database is Postgre" + "SQL. I use it for all projects." — destroying the semantic unit.

Semantic chunks produce self-contained, meaningful units that embed better (the embedding model captures a complete thought, not a fragment), retrieve more accurately (the retrieved chunk actually answers the query), and are human-readable in debug output.

Bounds: Min 50 tokens, max 300 tokens. Chunks below the minimum are merged with neighbors. Chunks above the maximum are split at sentence boundaries.

Consequences

  • SemanticChunker is more complex to implement than a simple text[:n] split
  • Chunk count per document is variable — context budget enforcement becomes more important
  • Edge case: very long single sentences (> 300 tokens, e.g. code) are split at whitespace as fallback

Rejected Alternatives

  • Fixed-size (N tokens): Simple but semantically incoherent. Fragments entities and facts.
  • Sentence-only: Produces too many small chunks for long documents, inflating vector DB size.
  • Paragraph-only: Paragraphs vary too widely in size (1 sentence to 50+). Unreliable bounds.