maru_sglang: chunk_size should be dynamically derived from per-token KV size

## Problem

MaruStorage uses a fixed `chunk_size_bytes` (default 1 MB) to allocate pages, but the actual per-token KV data stored per page is much smaller. For example, with Llama-3.1-8B (non-MLA, 32 layers, 8 KV heads, 128 head_dim, bf16):

- K per token (all layers): 64 KB
- V per token (all layers): 64 KB
- K+V concatenated per chunk: **128 KB**
- chunk_size_bytes: **1 MB**

This results in **87.5% internal fragmentation** — each 1 MB page stores only 128 KB of useful data.

With a 100 GB pool (`chunk_size=1MB`), only 102,400 pages are available. Since each token consumes one page, the pool is exhausted after ~102K tokens (~6-7 requests of 16K tokens), even though the actual KV data occupies only ~12.8 GB.

### Observed log

```
[2026-03-27 09:09:07,397] maru WARNING: Pool exhausted: no free pages available
[2026-03-27 09:09:08,309] maru INFO: Added owned region 1063: pages=102400, chunk_size=1048576
[2026-03-27 09:09:08,309] maru INFO: Expanded: new store region 1063 (pool_id=4294967295)
[2026-03-27 09:09:08,309] maru WARNING: Pool exhausted: no free pages available
```

GPU token usage is only 3% at this point.

## Proposed solution

sglang's `mem_pool_host` (HostKVCache) already exposes the necessary APIs:

```python
mem_pool_host.get_size_per_token()    # total KV bytes per token
mem_pool_host.get_ksize_per_token()   # K-only bytes per token (page_first layouts)
mem_pool_host.page_size               # tokens per page
```

Other sglang storage backends (e.g. HF3FS) already derive `bytes_per_page` dynamically in `backend_factory.py`:

```python
if layout in ["page_first", "page_first_direct"]:
    bytes_per_page = mem_pool_host.get_ksize_per_token() * mem_pool_host.page_size
```

MaruStorage should auto-calculate `chunk_size_bytes` in `register_mem_pool_host()` instead of using a fixed default. For non-MLA models, K and V are concatenated per key in `batch_set_v1`, so the effective chunk size should be `get_size_per_token() * page_size` (K+V combined).

The fixed `chunk_size_bytes` config can remain as an optional override, but the default should be dynamically derived.

## Environment

- Model: `meta-llama/Llama-3.1-8B-Instruct`
- sglang with HiCache (`page_first_direct` layout, `page_size=1`)
- maru_pool_size: 100G, chunk_size_bytes: 1MB (default)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maru_sglang: chunk_size should be dynamically derived from per-token KV size #34

Problem

Observed log

Proposed solution

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

maru_sglang: chunk_size should be dynamically derived from per-token KV size #34

Description

Problem

Observed log

Proposed solution

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions