Skip to content

maru_sglang: chunk_size should be dynamically derived from per-token KV size #34

@kihwan-XCENA

Description

@kihwan-XCENA

Problem

MaruStorage uses a fixed chunk_size_bytes (default 1 MB) to allocate pages, but the actual per-token KV data stored per page is much smaller. For example, with Llama-3.1-8B (non-MLA, 32 layers, 8 KV heads, 128 head_dim, bf16):

  • K per token (all layers): 64 KB
  • V per token (all layers): 64 KB
  • K+V concatenated per chunk: 128 KB
  • chunk_size_bytes: 1 MB

This results in 87.5% internal fragmentation — each 1 MB page stores only 128 KB of useful data.

With a 100 GB pool (chunk_size=1MB), only 102,400 pages are available. Since each token consumes one page, the pool is exhausted after ~102K tokens (~6-7 requests of 16K tokens), even though the actual KV data occupies only ~12.8 GB.

Observed log

[2026-03-27 09:09:07,397] maru WARNING: Pool exhausted: no free pages available
[2026-03-27 09:09:08,309] maru INFO: Added owned region 1063: pages=102400, chunk_size=1048576
[2026-03-27 09:09:08,309] maru INFO: Expanded: new store region 1063 (pool_id=4294967295)
[2026-03-27 09:09:08,309] maru WARNING: Pool exhausted: no free pages available

GPU token usage is only 3% at this point.

Proposed solution

sglang's mem_pool_host (HostKVCache) already exposes the necessary APIs:

mem_pool_host.get_size_per_token()    # total KV bytes per token
mem_pool_host.get_ksize_per_token()   # K-only bytes per token (page_first layouts)
mem_pool_host.page_size               # tokens per page

Other sglang storage backends (e.g. HF3FS) already derive bytes_per_page dynamically in backend_factory.py:

if layout in ["page_first", "page_first_direct"]:
    bytes_per_page = mem_pool_host.get_ksize_per_token() * mem_pool_host.page_size

MaruStorage should auto-calculate chunk_size_bytes in register_mem_pool_host() instead of using a fixed default. For non-MLA models, K and V are concatenated per key in batch_set_v1, so the effective chunk size should be get_size_per_token() * page_size (K+V combined).

The fixed chunk_size_bytes config can remain as an optional override, but the default should be dynamically derived.

Environment

  • Model: meta-llama/Llama-3.1-8B-Instruct
  • sglang with HiCache (page_first_direct layout, page_size=1)
  • maru_pool_size: 100G, chunk_size_bytes: 1MB (default)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions