Add hybrid prefix cache restore coverage#1314
Open
xxxkkw wants to merge 6 commits into
Open
Conversation
Fix prompt-cache serialization and server byte-limit wiring while adding conservative restore capability coverage for hybrid KV, rotating, recurrent, and chunked cache states.
Surface lookup, deepcopy, and restore timings from prompt-cache fetches so benchmarks can separate cache-management overhead from model prefill and decode time.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CacheListrestore support.ArraysCacheremains checkpoint-only by default, saturated rotating caches only restore at safe retained boundaries, andLRUPromptCachefalls back to shorter safe checkpoints when longer candidates cannot be restored.BatchRotatingKVCache.rotatedbool metadata round-trip,ChunkedKVCacheoffset metadata after front trimming, and server--prompt-cache-byteswiring.PrefixCacheSessionwrapper aroundLRUPromptCachefor non-server callers.merge().Environment
mlx-community/Qwen3-0.6B-4bitmlx-community/Qwen3.5-9B-MLX-4bit813917ef4d3e41; the later813917ecommit only adds prompt-cache fetch timing fields and targeted tests.Validation
python -m pytest tests/test_prompt_cache.py tests/test_prefix_cache_correctness.py tests/test_server.py -qpython -m pytest tests/test_generate.py -qpython -m unittest tests/test_prefix_cache_correctness.py tests/test_prompt_cache.pypython -m py_compile benchmarks/prefix_cache_benchmark.pyReal-weight prefix-cache benchmarks
All benchmark rows use
mlx-community/Qwen3-0.6B-4bit, prefill-only snapshots, and public model IDs. Local paths are omitted from this PR description.The latest rerun disabled EOS stopping so the output cap is actually reached. The 100K hot cases were run one per process after the all-in-one 100K cold+hot matrix completed cold prefill and then exhausted Metal memory while retaining multiple very large snapshots. Each isolated 100K hot run completed the full 10K output cap.
Example public command shape:
Qwen3.5 recurrent validation
A real-weight Qwen3.5 recurrent-cache check was run with
mlx-community/Qwen3.5-9B-MLX-4bit.Observed behavior:
exact_cached_tokens=530,exact_rest_tokens=0.fork_cached_tokens=516,fork_expected_cached_tokens=516, and matched the cold forked prompt.This is why the PR keeps
ArraysCachecheckpoint-only and does not implement a naiveArraysCache.trim()that clears recurrent state. Future recurrent tests should compare exact checkpoint/fork safety and boundary-aligned replay, and should not require bitwise-identical greedy output across different recurrent prefill chunk boundaries.Notes
This PR intentionally keeps recurrent
ArraysCacherestore conservative: it supports exact checkpoint reuse through the prompt-cache trie, but does not pretend arbitrary trim is safe for recurrent state. That leaves room for an explicit checkpoint/replay extension in a later recurrent-cache-specific change.