Skip to content

Support SHM-Backed L1 Memory Pool for Zero-Copy CPU Store/Retrieve Path #244

@hlin99

Description

@hlin99

Label
Please label your issue with "new feature" and any other relevant labels so that it can easily be easily categorized under LMCache Onboarding

Is your feature request related to a problem? Please describe.
Currently, the LMCache L1 memory pool uses anonymous mmap, which makes the L1 memory invisible to worker processes. For CPU store/retrieve operations, data must be serialized (pickle) and transported over ZMQ between the worker and server, then deserialized and copied into the L1 slab. For very large prompts (e.g., 128k tokens × 70KB/token = ~8.75GB), this results in multiple memory copies and serialization overhead, leading to significant latency (~440ms per request for DDR5-class systems).

The previous ring buffer optimization attempted to reduce copies but still required one copy from ring buffer into L1 slab, and the ring buffer itself consumed a large pre-allocated memory region — effectively doubling the memory reservation without eliminating all copies.

Describe the solution you'd like
Replace the L1 memory pool's anonymous mmap with a named shared memory segment (shm_open + mmap(MAP_SHARED)), making L1 slabs directly accessible to worker processes. Workers attach to the same shm pool once at startup, and receive offset/shape/dtype tuples for target memory slots at store/retrieve time.

  • For store: Workers gather KV data to CPU, request an L1 slot offset from the server, memcpy directly into the L1 shm slot, then notify the server to commit.
  • For retrieve: Workers request slot offsets for keys, construct a zero-copy tensor view backed by the shm offset, copy into their local paged KV cache, then notify the server to release the read lock.

This is the only CPU bounce path. The previous pickle-over-ZMQ and ring buffer paths are replaced entirely. Since both anonymous mmap and named shm are backed by the same kernel mechanism (mmap), the slab allocator, eviction controller, and lock management require zero changes.

Describe alternatives you've considered

  • Ring buffer zero-copy: Still requires one copy from ring buffer into L1 slab, and the ring buffer itself requires pre-allocating a large memory region — doubling memory usage without eliminating all copies.
  • Pickle-over-ZMQ (current path): Multiple serialization/deserialization steps plus memory copies. No longer needed since workers can directly access L1 via shm.
  • Keeping pickle as a fallback: Since both paths share the same L1 pool capacity and the same OOM behavior (silently skip keys, same as CUDA path), a pickle fallback provides no additional capability — it only adds code complexity.
  • GPU pinned memory: Not required for this feature; could be a future extension.

Additional context

Core Design: Why mmap and shm Are Fundamentally the Same

Both are backed by mmap under the hood; the only difference is visibility:

mmap(MAP_ANONYMOUS) shm_open + mmap(MAP_SHARED)
Kernel operation Maps anonymous pages into process virtual address space Creates a file on /dev/shm (tmpfs) → maps into virtual address space
Physical page allocation Lazy, allocated on first write Lazy, allocated on first write
Cross-process visibility ✅ Multiple processes can map the same named segment

From the slab allocator's perspective, it sees a (base_ptr, size) contiguous address space. Slab allocation logic, eviction logic, and lock management — none of them need any changes.


1. L1 Memory Pool Changes (Server Side)

1.1 Memory Pool Source Change

# Current (pseudocode)
buffer = mmap(size, MAP_PRIVATE | MAP_ANONYMOUS)
allocator = SlabAllocator(buffer, size)

# New shm (pseudocode)
fd = shm_open("/lmcache_l1_pool", O_CREAT | O_RDWR)
ftruncate(fd, size)
buffer = mmap(size, MAP_SHARED, fd)
allocator = SlabAllocator(buffer, size)   # ← completely unchanged

1.2 Startup /dev/shm Capacity Check — Fail-Fast

import shutil

def _check_shm_capacity(required_bytes: int) -> None:
    """Verify /dev/shm has sufficient space. Fail-fast if not."""
    shm_stat = shutil.disk_usage("/dev/shm")
    if shm_stat.free < required_bytes:
        raise RuntimeError(
            f"Insufficient /dev/shm space: need {required_bytes / 2**30:.1f} GiB, "
            f"available {shm_stat.free / 2**30:.1f} GiB. "
            f"Use 'docker run --shm-size={required_bytes * 2 // 2**30}g' or "
            f"set Kubernetes emptyDir.medium=Memory to increase /dev/shm size."
        )

Important: Docker defaults to only 64MB for /dev/shm. Use --shm-size to enlarge. Kubernetes requires emptyDir.medium: Memory.

Since SHM is the only CPU bounce path, insufficient /dev/shm means the server cannot function for CPU store/retrieve. The server must fail-fast with a clear, actionable error message rather than silently degrading.

1.3 MemoryObj: New offset Field

@dataclass
class MemoryObj:
    tensor: torch.Tensor
    shm_offset: int          # ← NEW: offset relative to shm base_ptr
    shm_byte_length: int     # ← NEW: byte size of this obj in the pool
    # ... other fields unchanged

1.4 L1MemoryManager: Expose SHM Info

def get_shm_pool_info(self) -> dict:
    return {
        "shm_name": self._shm_name,       # e.g. "/lmcache_l1_pool"
        "pool_size": self._size_in_bytes,
    }

1.5 Cleanup Strategy

  • Server is the sole owner of the shm segment and is responsible for shm_unlink
  • Workers only attach; they never unlink
  • On startup, if a stale shm with the same name exists (leftover from a previous crash), the server proactively unlinks and recreates it
  • Recommended: add a systemd ExecStopPost or Docker entrypoint cleanup script as a safety net

2. Worker Side: Attach and Tensor View Construction

2.1 Attach Flow

from multiprocessing import shared_memory

# Worker receives shm_name and pool_size from server during register_kv_caches
shm = shared_memory.SharedMemory(name="lmcache_l1_pool", create=False)
assert shm.size == expected_pool_size, "SHM pool size mismatch"

# Cache the handle for the worker's entire lifetime (attach only once)
self._l1_shm = shm
self._l1_buffer = shm.buf

2.2 Tensor View Construction

def _make_tensor_view(self, offset: int, length: int, shape: list[int], dtype: str) -> torch.Tensor:
    torch_dtype = getattr(torch, dtype)
    buf_view = self._l1_buffer[offset:offset + length]
    return torch.frombuffer(buf_view, dtype=torch_dtype).view(*shape)

2.3 Critical Constraints

  • torch.frombuffer returns a zero-copy view backed by shared memory
  • The worker must finish all data consumption (copy into paged KV cache) before calling finish_read; after that, the server may evict the slot at any time
  • Workers must not hold long-lived references to shm tensor views

3. Store Path (Two-Phase RPC)

3.1 Complete Flow

┌──────────┐                              ┌──────────┐
│  Worker  │                              │  Server  │
└────┬─────┘                              └────┬─────┘
     │                                         │
     │  1. prepare_store(keys, shape, dtype)    │
     │────────────────────────────────────────→│
     │                                         │  reserve_write() → slab allocate
     │                                         │  Keys that fail OOM are silently
     │                                         │  skipped (same as CUDA path)
     │                                         │  write_lock acquired for each
     │                                         │  successfully allocated key
     │  2. response: [(key, offset, shape)]    │
     │     (only successfully allocated keys)   │
     │←────────────────────────────────────────│
     │                                         │
     │  3. Worker memcpy directly into         │
     │     shm[offset] for each returned key   │
     │     tensor_view.copy_(cpu_chunk)        │
     │                                         │
     │  4. commit_store(keys)                  │
     │────────────────────────────────────────→│
     │                                         │  finish_write() → release write_lock
     │                                         │  Data now visible to readers
     │                                         │  StoreController notified →
     │                                         │  async L2 store triggered
     │  5. response: success                   │
     │←────────────────────────────────────────│

3.2 OOM Behavior (Consistent with CUDA Path)

When reserve_write returns OUT_OF_MEMORY for some keys, those keys are silently skipped — exactly the same as the existing CUDA GPU store path in server.py:

# Existing CUDA path behavior (server.py L324-L328):
for idx, obj_key in enumerate(obj_keys):
    if obj_key in reserved_dict:
        memory_obj = reserved_dict[obj_key]
    else:
        continue    # ← OOM keys silently skipped

The shm path follows the same pattern:

  • No retry, no fallback, no error reported to the worker for skipped keys
  • The background eviction controller will eventually free space
  • Future store requests may succeed for those keys
  • OOM keys are not stored to L2 either (L2 store is triggered by finish_write, which only runs for successfully allocated keys)

3.3 Server Handler (Pseudocode)

def prepare_store(self, keys, instance_id, shape, dtype):
    obj_keys = ipc_key_to_object_keys(...)
    layout_desc = get_layout_desc(...)

    # reserve_write returns only successfully allocated keys
    # OOM keys are filtered out (same as CUDA path)
    reserved_dict = self.storage_manager.reserve_write(obj_keys, layout_desc, "new")

    slots = []
    for obj_key, memory_obj in reserved_dict.items():
        slots.append(ShmSlotMetadata(
            key=str(obj_key),
            shm_name=self._shm_pool_name,
            offset=memory_obj.shm_offset,
            length=memory_obj.shm_byte_length,
            shape=list(memory_obj.tensor.shape),
            dtype=str(memory_obj.tensor.dtype).removeprefix("torch."),
        ))
    return PrepareStoreResponse(slots=slots)

def commit_store(self, keys):
    self.storage_manager.finish_write(keys)
    return True

3.4 Worker Side (Pseudocode)

def submit_store_request(self, ...):
    device_synchronize(self._device_type)
    cpu_chunks = gather_chunks_to_cpu(self.kv_caches, block_ids, ...)

    # Ask server for L1 shm slots
    response = send_rpc(RequestType.PREPARE_STORE, [key, instance_id, shape, dtype])

    # Only write keys that were successfully allocated (OOM keys not in response)
    for chunk, slot in zip(cpu_chunks, response.slots):
        tensor_view = self._make_tensor_view(slot.offset, slot.length, slot.shape, slot.dtype)
        tensor_view.copy_(chunk)

    if response.slots:
        send_rpc(RequestType.COMMIT_STORE, [slot.key for slot in response.slots])

4. Retrieve Path (Two-Phase RPC, Symmetric to Store)

4.1 Complete Flow

┌──────────┐                              ┌──────────┐
│  Worker  │                              │  Server  │
└────┬─────┘                              └────┬─────┘
     │                                         │
     │  1. prepare_retrieve(keys)              │
     │────────────────────────────────────────→│
     │                                         │  Check L1 hit
     │                                         │  If L2 miss → prefetch → wait ready
     │                                         │  Manually acquire read_lock
     │                                         │  (NOT via context manager)
     │  2. response: [(key, offset, shape)]    │
     │     read_lock held for returned keys    │
     │←────────────────────────────────────────│
     │                                         │
     │  3. Worker reads directly from          │
     │     shm[offset] for each returned key   │
     │     tensor_view = frombuffer(...)       │
     │     scatter_to_kv(tensor_view)          │
     │                                         │
     │  4. finish_read(keys)                   │
     │────────────────────────────────────────→│
     │                                         │  finish_read() → release read_lock
     │                                         │  Eviction controller may now
     │                                         │  reclaim these slots
     │  5. response: success                   │
     │←────────────────────────────────────────│

4.2 Server Handler (Pseudocode)

def prepare_retrieve(self, keys, instance_id):
    obj_keys = ipc_key_to_object_keys(...)

    # SHM path: manually acquire read locks WITHOUT using the context manager,
    # because the lock must be held until the worker calls finish_read().
    #
    # With the `with` statement, read_lock is auto-released on block exit —
    # that would free the slot before the worker has read the data.
    read_results = self.storage_manager.unsafe_read_prefetched(obj_keys)
    if not read_results or len(read_results) != len(obj_keys):
        # Some keys missing — release any locks we did acquire
        if read_results:
            self.storage_manager.finish_read_prefetched(
                [k for k in obj_keys if k in read_results]
            )
        return PrepareRetrieveResponse(success=False, slots=[])

    slots = []
    for memory_obj in read_results:
        slots.append(ShmSlotMetadata(
            shm_name=self._shm_pool_name,
            offset=memory_obj.shm_offset,
            length=memory_obj.shm_byte_length,
            shape=list(memory_obj.tensor.shape),
            dtype=str(memory_obj.tensor.dtype).removeprefix("torch."),
        ))
    # read_locks are held here and will NOT be released until
    # the worker explicitly calls finish_read() after consuming the data.
    return PrepareRetrieveResponse(success=True, slots=slots)

def finish_read(self, keys):
    # Worker has finished reading from shm — now safe to release read_locks.
    # After this, eviction controller may reclaim these slots.
    self.storage_manager.finish_read_prefetched(keys)

Why not use with read_prefetched_results()?

The existing context manager (read_prefetched_results) is designed for the
CUDA path where the server reads the data itself within the with block and
then the lock is released on exit. In the shm path, data consumption happens
in a different process (the worker) at a later time. Using the context
manager would release the read_lock before the worker starts reading,
allowing eviction to reclaim the slot and causing data corruption.

4.3 Worker Side (Pseudocode)

def get_finished(self, ...):
    response = rpc_result

    if response.success:
        for slot, block_ids in zip(response.slots, ...):
            tensor_view = self._make_tensor_view(slot.offset, slot.length, slot.shape, slot.dtype)
            scatter_cpu_chunks_to_kv(self.kv_caches, block_ids, [tensor_view], ...)
        send_rpc(RequestType.FINISH_READ, [keys])

5. Lock and Eviction (No Changes Needed — Built-in Protection)

Scenario Protection Mechanism
Worker writing to shm slot write_lock prevents eviction and concurrent reads
Worker reading from shm slot read_lock prevents eviction and concurrent writes
L1 memory full Eviction controller selects unlocked objs via LRU
Worker crashes without finish_write/read TTLLock auto-releases (write_ttl=600s, read_ttl=300s)
reserve_write OOM Keys silently skipped (consistent with CUDA path); background eviction frees space over time

reserve_write Internal Flow (Recap)

# Inside L1Manager.reserve_write():
for key in keys:
    if key exists and mode == "new": skip
    if key locked: skip

err, allocated_objs = memory_manager.allocate(layout_desc, count)
if err == OUT_OF_MEMORY:
    # Keys marked as OUT_OF_MEMORY, silently skipped by caller
    # Background eviction controller runs every 1s at watermark
    return {key: (OUT_OF_MEMORY, None) for key in need_to_allocate}

entry.write_lock.lock()
return (SUCCESS, memory_obj)

6. Protocol Changes

6.1 New RequestTypes

class RequestType(enum.Enum):
    # ... existing ...
    PREPARE_STORE = enum.auto()
    COMMIT_STORE = enum.auto()
    PREPARE_RETRIEVE = enum.auto()
    FINISH_READ = enum.auto()

6.2 Metadata Structures

class ShmSlotMetadata(msgspec.Struct):
    key: str
    shm_name: str
    offset: int
    length: int
    shape: list[int]
    dtype: str

class PrepareStoreResponse(msgspec.Struct):
    slots: list[ShmSlotMetadata]     # Only successfully allocated keys

class PrepareRetrieveResponse(msgspec.Struct):
    success: bool
    slots: list[ShmSlotMetadata]

6.3 No Path Selection — SHM Is the Only CPU Bounce Path

There is no configuration toggle or runtime path selection. The L1 memory pool is always backed by named shared memory. The previous pickle-over-ZMQ and ring buffer paths for CPU store/retrieve are fully replaced.

The CUDA GPU path (direct memcpy_async between GPU and L1 within the server process) is completely unaffected and continues to work exactly as before.


7. File Change Summary

File Changes
lmcache/v1/distributed/memory_manager.py Change L1MemoryManager.__init__ to use shm_open + mmap(MAP_SHARED) instead of anonymous mmap; add /dev/shm capacity check (fail-fast); expose shm_name via get_shm_pool_info(); add shm_unlink in close()
lmcache/v1/memory_management.py Add shm_offset and shm_byte_length fields to MemoryObj; populate offset during slab allocation
lmcache/v1/multiprocess/server.py Add prepare_store / commit_store / prepare_retrieve / finish_read handlers; return shm_name + pool_size in register_kv_cache response
lmcache/v1/multiprocess/protocols/base.py Add 4 new RequestType entries
lmcache/v1/multiprocess/protocols/engine.py Add 4 new protocol definitions; add ShmSlotMetadata, PrepareStoreResponse, PrepareRetrieveResponse
lmcache/integration/vllm/vllm_multi_process_adapter.py Worker attach shm during register_kv_caches; implement _make_tensor_view; two-phase RPC store/retrieve replacing pickle path
lmcache/v1/multiprocess/cpu_bounce_context.py Remove ring buffer logic; replace with shm tensor view gather/scatter

Files that do NOT need changes: Slab allocator internals, eviction controller, lock management (TTLLock), CUDA GPU path, StoreController (L1→L2), PrefetchController (L2→L1).


8. Error Handling and Edge Cases

Scenario Handling
/dev/shm insufficient at startup Fail-fast with RuntimeError and actionable message (Docker --shm-size, K8s emptyDir)
Stale shm from previous crash shm_unlink then recreate on startup
Worker attach fails (shm not found) Wait for server ready or throw FileNotFoundError with clear message
Worker attach size mismatch Throw ValueError to prevent offset out-of-bounds
Worker crashes without finish_write TTLLock timeout (600s) auto-releases write_lock
Worker crashes without finish_read TTLLock timeout (300s) auto-releases read_lock
Server crashes Worker RPC timeout; cleanup script handles shm_unlink
reserve_write OOM Keys silently skipped, consistent with CUDA path; background eviction frees space; OOM keys not stored to L2
shm deleted externally at runtime Server detects error on next access; requires restart
Docker environment Documentation and error messages must clearly state --shm-size requirement

9. Test Plan

Unit Tests

  • shm create / attach / unlink lifecycle
  • Startup fail-fast when /dev/shm is insufficient
  • Offset allocation and tensor view construction consistency
  • Batch allocation + batch deallocation
  • OOM scenario: verify keys are silently skipped (not retried, not stored to L2)
  • TTLLock timeout auto-release
  • Stale shm cleanup on restart

Integration Tests

  • Multiple workers concurrently attach to the same shm pool
  • Complete store two-phase RPC flow: prepare_store → worker memcpy → commit_store
  • Complete retrieve two-phase RPC flow: prepare_retrieve → worker read → finish_read
  • Store with partial OOM: verify only allocated keys are stored and committed
  • Retrieve with missing keys: verify partial failure handling and lock cleanup
  • Worker crash → lock auto-release → slot becomes evictable
  • End-to-end: store via shm → L2 async store triggered → evict from L1 → retrieve from L2 via prefetch → read via shm

Performance Tests

  • 128k token long-prompt store/retrieve latency comparison against previous pickle path
  • Throughput comparison under sustained load
  • Multi-worker concurrent stress test

Regression Tests

  • CUDA GPU path full regression (must remain completely unaffected)

10. Performance Estimation

Parameter Value
128k tokens × 70KB/token ~8.75 GiB
DDR5 single-socket bandwidth ~40 GB/s
One copy latency (shm path) ~220 ms
Two+ copies latency (previous pickle path) ~440 ms
Two ZMQ IPC RPC round-trips (prepare + commit) ~0.1 ms
Net savings per request ~220 ms

11. Future Extensions (Not In Scope)

  • CUDA pinned memory: cudaHostRegister on shm buffer for faster GPU DMA (requires evaluating ulimit -l limits)
  • HugePage alignment: Use 2MB HugePages for shm mmap to reduce TLB misses
  • NUMA affinity: Bind shm to specific NUMA node in multi-socket environments
  • Slab compaction: Handle fragmentation after long-running operation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions