Skip to content

ADR-005: Token Counting with tiktoken, Not Character Proxies

Date: 2024-01
Status: Accepted
Deciders: Core team


Context

The Context Budget Manager must enforce token limits so injected memory never overflows an LLM's context window. Token counting can be approximated (characters / 4) or done precisely with a tokenizer library.

Decision

Use tiktoken with the cl100k_base encoding (used by GPT-4, Claude, and most modern models) for all token counting.

Rationale:

Character-based approximations (chars / 4) have up to 30% error on code, multilingual text, and special characters. An error of 30% on a 2000-token budget means we might inject 2600 tokens — enough to overflow smaller context windows or degrade output quality.

tiktoken is fast (Rust-backed), reliable, and supports all encodings used by major LLMs. The cl100k_base encoding is a safe default that closely approximates token counts for Claude, GPT-4, Mistral, and Llama 3.

For models using different tokenizers (e.g. Gemini), the count will be approximate but within 5–10% — acceptable for budget enforcement purposes.

Consequences

  • tiktoken is a required dependency (adds ~2MB to install)
  • Token counting is synchronous but fast (<1ms for typical chunks)
  • The token_budget parameter in all public APIs refers to tiktoken cl100k_base tokens
  • Users targeting models with very different tokenizers (unusual) may need to set a conservative budget (e.g. 80% of the actual limit)

Rejected Alternatives

  • chars / 4 approximation: Fast and dependency-free but 30% error is unacceptable for reliable budget enforcement.
  • Model-specific tokenizers: More accurate, but requires knowing the target model upfront and managing multiple tokenizer dependencies. Not worth complexity for v0.x.