Problem
MaruStorage uses a fixed chunk_size_bytes (default 1 MB) to allocate pages, but the actual per-token KV data stored per page is much smaller. For example, with Llama-3.1-8B (non-MLA, 32 layers, 8 KV heads, 128 head_dim, bf16):
- K per token (all layers): 64 KB
- V per token (all layers): 64 KB
- K+V concatenated per chunk: 128 KB
- chunk_size_bytes: 1 MB
This results in 87.5% internal fragmentation — each 1 MB page stores only 128 KB of useful data.
With a 100 GB pool (chunk_size=1MB), only 102,400 pages are available. Since each token consumes one page, the pool is exhausted after ~102K tokens (~6-7 requests of 16K tokens), even though the actual KV data occupies only ~12.8 GB.
Observed log
[2026-03-27 09:09:07,397] maru WARNING: Pool exhausted: no free pages available
[2026-03-27 09:09:08,309] maru INFO: Added owned region 1063: pages=102400, chunk_size=1048576
[2026-03-27 09:09:08,309] maru INFO: Expanded: new store region 1063 (pool_id=4294967295)
[2026-03-27 09:09:08,309] maru WARNING: Pool exhausted: no free pages available
GPU token usage is only 3% at this point.
Proposed solution
sglang's mem_pool_host (HostKVCache) already exposes the necessary APIs:
mem_pool_host.get_size_per_token() # total KV bytes per token
mem_pool_host.get_ksize_per_token() # K-only bytes per token (page_first layouts)
mem_pool_host.page_size # tokens per page
Other sglang storage backends (e.g. HF3FS) already derive bytes_per_page dynamically in backend_factory.py:
if layout in ["page_first", "page_first_direct"]:
bytes_per_page = mem_pool_host.get_ksize_per_token() * mem_pool_host.page_size
MaruStorage should auto-calculate chunk_size_bytes in register_mem_pool_host() instead of using a fixed default. For non-MLA models, K and V are concatenated per key in batch_set_v1, so the effective chunk size should be get_size_per_token() * page_size (K+V combined).
The fixed chunk_size_bytes config can remain as an optional override, but the default should be dynamically derived.
Environment
- Model:
meta-llama/Llama-3.1-8B-Instruct
- sglang with HiCache (
page_first_direct layout, page_size=1)
- maru_pool_size: 100G, chunk_size_bytes: 1MB (default)
Problem
MaruStorage uses a fixed
chunk_size_bytes(default 1 MB) to allocate pages, but the actual per-token KV data stored per page is much smaller. For example, with Llama-3.1-8B (non-MLA, 32 layers, 8 KV heads, 128 head_dim, bf16):This results in 87.5% internal fragmentation — each 1 MB page stores only 128 KB of useful data.
With a 100 GB pool (
chunk_size=1MB), only 102,400 pages are available. Since each token consumes one page, the pool is exhausted after ~102K tokens (~6-7 requests of 16K tokens), even though the actual KV data occupies only ~12.8 GB.Observed log
GPU token usage is only 3% at this point.
Proposed solution
sglang's
mem_pool_host(HostKVCache) already exposes the necessary APIs:Other sglang storage backends (e.g. HF3FS) already derive
bytes_per_pagedynamically inbackend_factory.py:MaruStorage should auto-calculate
chunk_size_bytesinregister_mem_pool_host()instead of using a fixed default. For non-MLA models, K and V are concatenated per key inbatch_set_v1, so the effective chunk size should beget_size_per_token() * page_size(K+V combined).The fixed
chunk_size_bytesconfig can remain as an optional override, but the default should be dynamically derived.Environment
meta-llama/Llama-3.1-8B-Instructpage_first_directlayout,page_size=1)