Skip to content

Add hybrid prefix cache restore coverage#1314

Open
xxxkkw wants to merge 6 commits into
ml-explore:mainfrom
xxxkkw:hybrid-prefix-cache-contract
Open

Add hybrid prefix cache restore coverage#1314
xxxkkw wants to merge 6 commits into
ml-explore:mainfrom
xxxkkw:hybrid-prefix-cache-contract

Conversation

@xxxkkw
Copy link
Copy Markdown

@xxxkkw xxxkkw commented May 27, 2026

Summary

  • Add conservative cache restore capability primitives for prefix-cache reuse, including retained range/result metadata and recursive CacheList restore support.
  • Preserve safe boundaries for hybrid caches: ArraysCache remains checkpoint-only by default, saturated rotating caches only restore at safe retained boundaries, and LRUPromptCache falls back to shorter safe checkpoints when longer candidates cannot be restored.
  • Fix related cache/server gaps: BatchRotatingKVCache.rotated bool metadata round-trip, ChunkedKVCache offset metadata after front trimming, and server --prompt-cache-bytes wiring.
  • Add a reusable prefix-cache benchmark that reports cold, exact-prefix hot, nearest-prefix hot reuse, and prompt-cache fetch overhead split into lookup/deepcopy/restore timings.
  • Expose a small library-level PrefixCacheSession wrapper around LRUPromptCache for non-server callers.
  • Document the single-request prompt checkpoint boundary without fabricating segment checkpoints before prefill boundary callbacks exist.
  • Log cache types that disable server batching when their prompt-cache implementation does not implement merge().

Environment

  • OS: macOS 26.4.1 arm64
  • Hardware: MacBook Pro (MacBookPro18,2), Apple M1 Max, 32 GB memory
  • Python: local project virtual environment
  • Model used for real-weight benchmarks: mlx-community/Qwen3-0.6B-4bit
  • Model used for recurrent validation: mlx-community/Qwen3.5-9B-MLX-4bit
  • Latest PR head tested: 813917e
  • Long benchmark table collected on f4d3e41; the later 813917e commit only adds prompt-cache fetch timing fields and targeted tests.

Validation

  • python -m pytest tests/test_prompt_cache.py tests/test_prefix_cache_correctness.py tests/test_server.py -q
    • 70 passed
  • python -m pytest tests/test_generate.py -q
    • 24 passed
  • python -m unittest tests/test_prefix_cache_correctness.py tests/test_prompt_cache.py
    • 50 passed after adding prompt-cache fetch timing fields
  • python -m py_compile benchmarks/prefix_cache_benchmark.py
    • passed

Real-weight prefix-cache benchmarks

All benchmark rows use mlx-community/Qwen3-0.6B-4bit, prefill-only snapshots, and public model IDs. Local paths are omitted from this PR description.

The latest rerun disabled EOS stopping so the output cap is actually reached. The 100K hot cases were run one per process after the all-in-one 100K cold+hot matrix completed cold prefill and then exhausted Metal memory while retaining multiple very large snapshots. Each isolated 100K hot run completed the full 10K output cap.

Target prefix Output cap Case Prompt tokens Cached Replayed Generated TTFT Total TTFT speedup Total speedup
2K 1024 cold full prefill 2094 0 2094 1024 0.7284s 5.3377s 1.00x 1.00x
2K 1024 exact-prefix hot reuse 2094 2093 1 1024 0.1180s 4.6918s 6.17x 1.14x
2K 1024 nearest-prefix hot reuse 2094 2071 23 1024 0.1510s 4.7546s 4.83x 1.12x
20K 2048 cold full prefill 20076 0 20076 2048 14.4672s 42.1227s 1.00x 1.00x
20K 2048 exact-prefix hot reuse 20076 20075 1 2048 0.3528s 28.2190s 41.01x 1.49x
20K 2048 nearest-prefix hot reuse 20076 20053 23 2048 1.1719s 27.7468s 12.35x 1.52x
100K 10000 cold full prefill 100057 0 100057 10000 269.1506s 1033.9337s 1.00x 1.00x
100K 10000 exact-prefix hot reuse 100057 100056 1 10000 13.3140s 604.1526s 20.21x 1.71x
100K 10000 nearest-prefix hot reuse 100057 100034 23 10000 17.0624s 587.7549s 15.77x 1.76x

Example public command shape:

python benchmarks/prefix_cache_benchmark.py \
  --model mlx-community/Qwen3-0.6B-4bit \
  --target-prefix-tokens 100000 \
  --max-tokens 10000 \
  --prefill-step-size 1024

Qwen3.5 recurrent validation

A real-weight Qwen3.5 recurrent-cache check was run with mlx-community/Qwen3.5-9B-MLX-4bit.

Observed behavior:

  • Exact checkpoint lookup reused the full prompt: exact_cached_tokens=530, exact_rest_tokens=0.
  • Forked longer-hit lookup fell back to the shorter recurrent checkpoint: fork_cached_tokens=516, fork_expected_cached_tokens=516, and matched the cold forked prompt.
  • Prefix checkpoint + suffix replay used the intended LRU path: manual split replay and LRU replay produced identical generated tokens.
  • Qwen3.5 recurrent state is sensitive to the prefill chunk boundaries used to build the prefix checkpoint. A prefix snapshot built in 256-token chunks produced different later greedy tokens than a cold path that crossed the checkpoint boundary inside a larger prefill chunk. When the prefix snapshot and replay used the same prefix chunking boundary, the split and cold paths matched.

This is why the PR keeps ArraysCache checkpoint-only and does not implement a naive ArraysCache.trim() that clears recurrent state. Future recurrent tests should compare exact checkpoint/fork safety and boundary-aligned replay, and should not require bitwise-identical greedy output across different recurrent prefill chunk boundaries.

Notes

This PR intentionally keeps recurrent ArraysCache restore conservative: it supports exact checkpoint reuse through the prompt-cache trie, but does not pretend arbitrary trim is safe for recurrent state. That leaves room for an explicit checkpoint/replay extension in a later recurrent-cache-specific change.

xxxkkw added 6 commits May 27, 2026 19:43
Fix prompt-cache serialization and server byte-limit wiring while adding conservative restore capability coverage for hybrid KV, rotating, recurrent, and chunked cache states.
Surface lookup, deepcopy, and restore timings from prompt-cache fetches so benchmarks can separate cache-management overhead from model prefill and decode time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant