feat(kvstore): support mamba l2 cache transfers#162
Conversation
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
| } | ||
|
|
||
| return Draining{BuildWriteBackPairs(write_diff), std::move(device_node_ref), std::move(host_node_ref)}; | ||
| if (need_mamba_writeback) { |
There was a problem hiding this comment.
Currently, we only call write_back when a request finishes or is retracted, and we only write back the last Mamba slot of that request. Should we also drain Mamba slots during the prefill stage to improve the cache hit rate?
| } | ||
| hybrid_prefix_cache_->AttachHostMamba(terminal, std::move(host_slot), terminal->MambaDeviceDepthTokens()); | ||
| pages_to_transfer.push_back( | ||
| {CacheTransferKind::Mamba, terminal->MambaSlotIndex(), terminal->MambaHostSlotIndex()}); |
There was a problem hiding this comment.
Will the mamba_host_slot associated with a retract request be protected from eviction?
|
|
||
| def write(self, executor, op: _TransferOp, prepared) -> None: | ||
| executor._copy_mamba_slots( | ||
| executor.mamba_pool, |
There was a problem hiding this comment.
QQ, in KVTransferBackend, there are draft_pool write/load operations, will mamba be possible with a draft pool ?
There was a problem hiding this comment.
The current draft version of Qwen3.5 does not include Mamba layers; however, I cannot confirm whether a Mamba pool will be introduced in the future or other models.
| auto slot = hybrid_prefix_cache_->AllocateDeviceMamba(); | ||
| if (slot == nullptr) return {}; | ||
| hybrid_prefix_cache_->LoadBackMamba(node, std::move(slot)); | ||
| match_result.mamba_cow_src_index = node->MambaSlotIndex(); |
There was a problem hiding this comment.
mamba_loadback_diff is a vector and for now its length is alway 1, but overwrite mamba_cow_src_index in for-loop seems weird, should it be lifted as a vector too?
| conv_dtype=self.conv_dtype, | ||
| ssm_dtype=self.ssm_dtype, | ||
| mamba_layer_ids=self.mamba_layer_ids, | ||
| device="cpu", |
There was a problem hiding this comment.
should the memory of mamba host pool be pinned?
|
LGTM overall, since #146 was just merged, there are conflicts in some files, resolve them to merge 🚀🙌🏻 |
Summary
This PR adds L2 cache support for Mamba cache alongside the existing KV cache L2 path.
The core change is to generalize cache movement into CacheTransferUnit, which carries the cache kind (KV or Mamba) plus source and destination slots/pages. WriteBackOperation and LoadBackOperation can now carry mixed KV/Mamba transfer units without adding Mamba-specific scheduler branches everywhere.
Key behavior:
Important design points:
Test Plan
CUDA_VISIBLE_DEVICES=4,5,6,7 \ ts serve Qwen/Qwen3.5-122B-A10B \ --tp 4 \ --max-num-seqs 16 \ --max-total-tokens 160000 \ --max-model-len 80000 \ --chunked-prefill-size 128 \ --max-prefill-tokens 128 \ --block-size 64 \ --max-mamba-cache-size 64 \ --mamba-track-interval 64 \ --kvstore-ratio 1.0 \ --kvstore-io-backend kernel \ --disable-prefill-graph \ --enforce-eager \ --disable-overlap-schedule \ --enable-cache-report \ --host 127.0.0.1 \ --port 8000evalscope eval \ --model Qwen/Qwen3.5-122B-A10B \ --api-url http://127.0.0.1:8000/v1 \ --api-key EMPTY_TOKEN \ --datasets aime25 \ --eval-batch-size 16 \ --generation-config '{"max_tokens":100000}'mamba loadback/writeback can be seen. The result: