Benchmarking Guide — Memory Vault AI¶
This guide explains how to run the built-in performance benchmark suite and record reproducible baseline numbers.
What It Measures¶
The benchmark suite currently measures two high-signal operations:
savelatency and throughputrecalllatency and throughput
It runs against an in-memory benchmark harness so results are reproducible and not dominated by external storage/network variability.
Run The Suite¶
From the repository root:
python scripts/benchmark/run_benchmark_suite.py
On Windows with the project virtual environment:
d:/github/Memory-Vault-AI/.venv/Scripts/python.exe scripts/benchmark/run_benchmark_suite.py
Common Options¶
python scripts/benchmark/run_benchmark_suite.py \
--save-count 1000 \
--recall-count 500 \
--warmup-saves 100 \
--warmup-recalls 50 \
--top-k 5 \
--token-budget 512 \
--seed 42 \
--format json \
--output-file ./benchmark-results/local-baseline.json
Key flags:
--save-count: Number of measured save operations.--recall-count: Number of measured recall operations.--warmup-saves: Save operations executed before timing starts.--warmup-recalls: Recall operations executed before timing starts.--top-k: Recalltop_kused in measured runs.--token-budget: Recall token budget used in measured runs.--seed: Deterministic random seed for synthetic workload generation.--format:textorjsonoutput.--output-file: Optional path to write the report.
Report Fields¶
For each operation (save, recall), the report includes:
countduration_secondsthroughput_ops_per_secondlatency_ms:min_msmean_msp50_msp95_msp99_msmax_ms
Use p95_ms as your primary latency KPI and throughput as a secondary scaling signal.
Baseline Workflow¶
- Run the benchmark 3-5 times on the same machine and environment.
- Store JSON outputs under
benchmark-results/(gitignored or attached in CI artifacts). - Compare current
p95_msand throughput against your baseline. - Investigate any regression greater than 10% before merging.
Notes¶
- This suite is designed for regression tracking and relative comparisons.
- Absolute production performance depends on storage backend, embedding model, dataset shape, and hardware.