Add hybrid prefix cache restore coverage by xxxkkw · Pull Request #1314 · ml-explore/mlx-lm

xxxkkw · 2026-05-27T12:39:34Z

Summary

Add conservative cache restore capability primitives for prefix-cache reuse, including retained range/result metadata and recursive CacheList restore support.
Preserve safe boundaries for hybrid caches: ArraysCache remains checkpoint-only by default, saturated rotating caches only restore at safe retained boundaries, and LRUPromptCache falls back to shorter safe checkpoints when longer candidates cannot be restored.
Fix related cache/server gaps: BatchRotatingKVCache.rotated bool metadata round-trip, ChunkedKVCache offset metadata after front trimming, and server --prompt-cache-bytes wiring.
Add a reusable prefix-cache benchmark that reports cold, exact-prefix hot, nearest-prefix hot reuse, and prompt-cache fetch overhead split into lookup/deepcopy/restore timings.
Expose a small library-level PrefixCacheSession wrapper around LRUPromptCache for non-server callers.
Document the single-request prompt checkpoint boundary without fabricating segment checkpoints before prefill boundary callbacks exist.
Log cache types that disable server batching when their prompt-cache implementation does not implement merge().

Environment

OS: macOS 26.4.1 arm64
Hardware: MacBook Pro (MacBookPro18,2), Apple M1 Max, 32 GB memory
Python: local project virtual environment
Model used for real-weight benchmarks: mlx-community/Qwen3-0.6B-4bit
Model used for recurrent validation: mlx-community/Qwen3.5-9B-MLX-4bit
Latest PR head tested: 813917e
Long benchmark table collected on f4d3e41; the later 813917e commit only adds prompt-cache fetch timing fields and targeted tests.

Validation

python -m pytest tests/test_prompt_cache.py tests/test_prefix_cache_correctness.py tests/test_server.py -q
- 70 passed
python -m pytest tests/test_generate.py -q
- 24 passed
python -m unittest tests/test_prefix_cache_correctness.py tests/test_prompt_cache.py
- 50 passed after adding prompt-cache fetch timing fields
python -m py_compile benchmarks/prefix_cache_benchmark.py
- passed

Real-weight prefix-cache benchmarks

All benchmark rows use mlx-community/Qwen3-0.6B-4bit, prefill-only snapshots, and public model IDs. Local paths are omitted from this PR description.

The latest rerun disabled EOS stopping so the output cap is actually reached. The 100K hot cases were run one per process after the all-in-one 100K cold+hot matrix completed cold prefill and then exhausted Metal memory while retaining multiple very large snapshots. Each isolated 100K hot run completed the full 10K output cap.

Target prefix	Output cap	Case	Prompt tokens	Cached	Replayed	Generated	TTFT	Total	TTFT speedup	Total speedup
2K	1024	cold full prefill	2094	0	2094	1024	0.7284s	5.3377s	1.00x	1.00x
2K	1024	exact-prefix hot reuse	2094	2093	1	1024	0.1180s	4.6918s	6.17x	1.14x
2K	1024	nearest-prefix hot reuse	2094	2071	23	1024	0.1510s	4.7546s	4.83x	1.12x
20K	2048	cold full prefill	20076	0	20076	2048	14.4672s	42.1227s	1.00x	1.00x
20K	2048	exact-prefix hot reuse	20076	20075	1	2048	0.3528s	28.2190s	41.01x	1.49x
20K	2048	nearest-prefix hot reuse	20076	20053	23	2048	1.1719s	27.7468s	12.35x	1.52x
100K	10000	cold full prefill	100057	0	100057	10000	269.1506s	1033.9337s	1.00x	1.00x
100K	10000	exact-prefix hot reuse	100057	100056	1	10000	13.3140s	604.1526s	20.21x	1.71x
100K	10000	nearest-prefix hot reuse	100057	100034	23	10000	17.0624s	587.7549s	15.77x	1.76x

Example public command shape:

python benchmarks/prefix_cache_benchmark.py \
  --model mlx-community/Qwen3-0.6B-4bit \
  --target-prefix-tokens 100000 \
  --max-tokens 10000 \
  --prefill-step-size 1024

Qwen3.5 recurrent validation

A real-weight Qwen3.5 recurrent-cache check was run with mlx-community/Qwen3.5-9B-MLX-4bit.

Observed behavior:

Exact checkpoint lookup reused the full prompt: exact_cached_tokens=530, exact_rest_tokens=0.
Forked longer-hit lookup fell back to the shorter recurrent checkpoint: fork_cached_tokens=516, fork_expected_cached_tokens=516, and matched the cold forked prompt.
Prefix checkpoint + suffix replay used the intended LRU path: manual split replay and LRU replay produced identical generated tokens.
Qwen3.5 recurrent state is sensitive to the prefill chunk boundaries used to build the prefix checkpoint. A prefix snapshot built in 256-token chunks produced different later greedy tokens than a cold path that crossed the checkpoint boundary inside a larger prefill chunk. When the prefix snapshot and replay used the same prefix chunking boundary, the split and cold paths matched.

This is why the PR keeps ArraysCache checkpoint-only and does not implement a naive ArraysCache.trim() that clears recurrent state. Future recurrent tests should compare exact checkpoint/fork safety and boundary-aligned replay, and should not require bitwise-identical greedy output across different recurrent prefill chunk boundaries.

Notes

This PR intentionally keeps recurrent ArraysCache restore conservative: it supports exact checkpoint reuse through the prompt-cache trie, but does not pretend arbitrary trim is safe for recurrent state. That leaves room for an explicit checkpoint/replay extension in a later recurrent-cache-specific change.

Fix prompt-cache serialization and server byte-limit wiring while adding conservative restore capability coverage for hybrid KV, rotating, recurrent, and chunked cache states.

Surface lookup, deepcopy, and restore timings from prompt-cache fetches so benchmarks can separate cache-management overhead from model prefill and decode time.

xxxkkw added 6 commits May 27, 2026 19:43

Add hybrid prefix cache restore coverage

21a4cb9

Fix prompt-cache serialization and server byte-limit wiring while adding conservative restore capability coverage for hybrid KV, rotating, recurrent, and chunked cache states.

Add prefix cache benchmark

9b0b4e5

Add prefix cache session helper

b3617df

Document single-path prompt checkpoint boundary

01b7c89

Log cache types that disable batching

f4d3e41

Report prefix cache fetch timings

813917e

Surface lookup, deepcopy, and restore timings from prompt-cache fetches so benchmarks can separate cache-management overhead from model prefill and decode time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hybrid prefix cache restore coverage#1314

Add hybrid prefix cache restore coverage#1314
xxxkkw wants to merge 6 commits into
ml-explore:mainfrom
xxxkkw:hybrid-prefix-cache-contract

xxxkkw commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xxxkkw commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Environment

Validation

Real-weight prefix-cache benchmarks

Qwen3.5 recurrent validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xxxkkw commented May 27, 2026 •

edited

Loading