Label
Please label your issue with "new feature" and any other relevant labels so that it can easily be easily categorized under LMCache Onboarding
Is your feature request related to a problem? Please describe.
Currently, the LMCache L1 memory pool uses anonymous mmap, which makes the L1 memory invisible to worker processes. For CPU store/retrieve operations, data must be serialized (pickle) and transported over ZMQ between the worker and server, then deserialized and copied into the L1 slab. For very large prompts (e.g., 128k tokens × 70KB/token = ~8.75GB), this results in multiple memory copies and serialization overhead, leading to significant latency (~440ms per request for DDR5-class systems).
The previous ring buffer optimization attempted to reduce copies but still required one copy from ring buffer into L1 slab, and the ring buffer itself consumed a large pre-allocated memory region — effectively doubling the memory reservation without eliminating all copies.
Describe the solution you'd like
Replace the L1 memory pool's anonymous mmap with a named shared memory segment (shm_open + mmap(MAP_SHARED)), making L1 slabs directly accessible to worker processes. Workers attach to the same shm pool once at startup, and receive offset/shape/dtype tuples for target memory slots at store/retrieve time.
- For store: Workers gather KV data to CPU, request an L1 slot offset from the server,
memcpy directly into the L1 shm slot, then notify the server to commit.
- For retrieve: Workers request slot offsets for keys, construct a zero-copy tensor view backed by the shm offset, copy into their local paged KV cache, then notify the server to release the read lock.
This is the only CPU bounce path. The previous pickle-over-ZMQ and ring buffer paths are replaced entirely. Since both anonymous mmap and named shm are backed by the same kernel mechanism (mmap), the slab allocator, eviction controller, and lock management require zero changes.
Describe alternatives you've considered
- Ring buffer zero-copy: Still requires one copy from ring buffer into L1 slab, and the ring buffer itself requires pre-allocating a large memory region — doubling memory usage without eliminating all copies.
- Pickle-over-ZMQ (current path): Multiple serialization/deserialization steps plus memory copies. No longer needed since workers can directly access L1 via shm.
- Keeping pickle as a fallback: Since both paths share the same L1 pool capacity and the same OOM behavior (silently skip keys, same as CUDA path), a pickle fallback provides no additional capability — it only adds code complexity.
- GPU pinned memory: Not required for this feature; could be a future extension.
Additional context
Core Design: Why mmap and shm Are Fundamentally the Same
Both are backed by mmap under the hood; the only difference is visibility:
|
mmap(MAP_ANONYMOUS) |
shm_open + mmap(MAP_SHARED) |
| Kernel operation |
Maps anonymous pages into process virtual address space |
Creates a file on /dev/shm (tmpfs) → maps into virtual address space |
| Physical page allocation |
Lazy, allocated on first write |
Lazy, allocated on first write |
| Cross-process visibility |
❌ |
✅ Multiple processes can map the same named segment |
From the slab allocator's perspective, it sees a (base_ptr, size) contiguous address space. Slab allocation logic, eviction logic, and lock management — none of them need any changes.
1. L1 Memory Pool Changes (Server Side)
1.1 Memory Pool Source Change
# Current (pseudocode)
buffer = mmap(size, MAP_PRIVATE | MAP_ANONYMOUS)
allocator = SlabAllocator(buffer, size)
# New shm (pseudocode)
fd = shm_open("/lmcache_l1_pool", O_CREAT | O_RDWR)
ftruncate(fd, size)
buffer = mmap(size, MAP_SHARED, fd)
allocator = SlabAllocator(buffer, size) # ← completely unchanged
1.2 Startup /dev/shm Capacity Check — Fail-Fast
import shutil
def _check_shm_capacity(required_bytes: int) -> None:
"""Verify /dev/shm has sufficient space. Fail-fast if not."""
shm_stat = shutil.disk_usage("/dev/shm")
if shm_stat.free < required_bytes:
raise RuntimeError(
f"Insufficient /dev/shm space: need {required_bytes / 2**30:.1f} GiB, "
f"available {shm_stat.free / 2**30:.1f} GiB. "
f"Use 'docker run --shm-size={required_bytes * 2 // 2**30}g' or "
f"set Kubernetes emptyDir.medium=Memory to increase /dev/shm size."
)
Important: Docker defaults to only 64MB for /dev/shm. Use --shm-size to enlarge. Kubernetes requires emptyDir.medium: Memory.
Since SHM is the only CPU bounce path, insufficient /dev/shm means the server cannot function for CPU store/retrieve. The server must fail-fast with a clear, actionable error message rather than silently degrading.
1.3 MemoryObj: New offset Field
@dataclass
class MemoryObj:
tensor: torch.Tensor
shm_offset: int # ← NEW: offset relative to shm base_ptr
shm_byte_length: int # ← NEW: byte size of this obj in the pool
# ... other fields unchanged
1.4 L1MemoryManager: Expose SHM Info
def get_shm_pool_info(self) -> dict:
return {
"shm_name": self._shm_name, # e.g. "/lmcache_l1_pool"
"pool_size": self._size_in_bytes,
}
1.5 Cleanup Strategy
- Server is the sole owner of the shm segment and is responsible for
shm_unlink
- Workers only attach; they never unlink
- On startup, if a stale shm with the same name exists (leftover from a previous crash), the server proactively unlinks and recreates it
- Recommended: add a systemd
ExecStopPost or Docker entrypoint cleanup script as a safety net
2. Worker Side: Attach and Tensor View Construction
2.1 Attach Flow
from multiprocessing import shared_memory
# Worker receives shm_name and pool_size from server during register_kv_caches
shm = shared_memory.SharedMemory(name="lmcache_l1_pool", create=False)
assert shm.size == expected_pool_size, "SHM pool size mismatch"
# Cache the handle for the worker's entire lifetime (attach only once)
self._l1_shm = shm
self._l1_buffer = shm.buf
2.2 Tensor View Construction
def _make_tensor_view(self, offset: int, length: int, shape: list[int], dtype: str) -> torch.Tensor:
torch_dtype = getattr(torch, dtype)
buf_view = self._l1_buffer[offset:offset + length]
return torch.frombuffer(buf_view, dtype=torch_dtype).view(*shape)
2.3 Critical Constraints
torch.frombuffer returns a zero-copy view backed by shared memory
- The worker must finish all data consumption (copy into paged KV cache) before calling
finish_read; after that, the server may evict the slot at any time
- Workers must not hold long-lived references to shm tensor views
3. Store Path (Two-Phase RPC)
3.1 Complete Flow
┌──────────┐ ┌──────────┐
│ Worker │ │ Server │
└────┬─────┘ └────┬─────┘
│ │
│ 1. prepare_store(keys, shape, dtype) │
│────────────────────────────────────────→│
│ │ reserve_write() → slab allocate
│ │ Keys that fail OOM are silently
│ │ skipped (same as CUDA path)
│ │ write_lock acquired for each
│ │ successfully allocated key
│ 2. response: [(key, offset, shape)] │
│ (only successfully allocated keys) │
│←────────────────────────────────────────│
│ │
│ 3. Worker memcpy directly into │
│ shm[offset] for each returned key │
│ tensor_view.copy_(cpu_chunk) │
│ │
│ 4. commit_store(keys) │
│────────────────────────────────────────→│
│ │ finish_write() → release write_lock
│ │ Data now visible to readers
│ │ StoreController notified →
│ │ async L2 store triggered
│ 5. response: success │
│←────────────────────────────────────────│
3.2 OOM Behavior (Consistent with CUDA Path)
When reserve_write returns OUT_OF_MEMORY for some keys, those keys are silently skipped — exactly the same as the existing CUDA GPU store path in server.py:
# Existing CUDA path behavior (server.py L324-L328):
for idx, obj_key in enumerate(obj_keys):
if obj_key in reserved_dict:
memory_obj = reserved_dict[obj_key]
else:
continue # ← OOM keys silently skipped
The shm path follows the same pattern:
- No retry, no fallback, no error reported to the worker for skipped keys
- The background eviction controller will eventually free space
- Future store requests may succeed for those keys
- OOM keys are not stored to L2 either (L2 store is triggered by
finish_write, which only runs for successfully allocated keys)
3.3 Server Handler (Pseudocode)
def prepare_store(self, keys, instance_id, shape, dtype):
obj_keys = ipc_key_to_object_keys(...)
layout_desc = get_layout_desc(...)
# reserve_write returns only successfully allocated keys
# OOM keys are filtered out (same as CUDA path)
reserved_dict = self.storage_manager.reserve_write(obj_keys, layout_desc, "new")
slots = []
for obj_key, memory_obj in reserved_dict.items():
slots.append(ShmSlotMetadata(
key=str(obj_key),
shm_name=self._shm_pool_name,
offset=memory_obj.shm_offset,
length=memory_obj.shm_byte_length,
shape=list(memory_obj.tensor.shape),
dtype=str(memory_obj.tensor.dtype).removeprefix("torch."),
))
return PrepareStoreResponse(slots=slots)
def commit_store(self, keys):
self.storage_manager.finish_write(keys)
return True
3.4 Worker Side (Pseudocode)
def submit_store_request(self, ...):
device_synchronize(self._device_type)
cpu_chunks = gather_chunks_to_cpu(self.kv_caches, block_ids, ...)
# Ask server for L1 shm slots
response = send_rpc(RequestType.PREPARE_STORE, [key, instance_id, shape, dtype])
# Only write keys that were successfully allocated (OOM keys not in response)
for chunk, slot in zip(cpu_chunks, response.slots):
tensor_view = self._make_tensor_view(slot.offset, slot.length, slot.shape, slot.dtype)
tensor_view.copy_(chunk)
if response.slots:
send_rpc(RequestType.COMMIT_STORE, [slot.key for slot in response.slots])
4. Retrieve Path (Two-Phase RPC, Symmetric to Store)
4.1 Complete Flow
┌──────────┐ ┌──────────┐
│ Worker │ │ Server │
└────┬─────┘ └────┬─────┘
│ │
│ 1. prepare_retrieve(keys) │
│────────────────────────────────────────→│
│ │ Check L1 hit
│ │ If L2 miss → prefetch → wait ready
│ │ Manually acquire read_lock
│ │ (NOT via context manager)
│ 2. response: [(key, offset, shape)] │
│ read_lock held for returned keys │
│←────────────────────────────────────────│
│ │
│ 3. Worker reads directly from │
│ shm[offset] for each returned key │
│ tensor_view = frombuffer(...) │
│ scatter_to_kv(tensor_view) │
│ │
│ 4. finish_read(keys) │
│────────────────────────────────────────→│
│ │ finish_read() → release read_lock
│ │ Eviction controller may now
│ │ reclaim these slots
│ 5. response: success │
│←────────────────────────────────────────│
4.2 Server Handler (Pseudocode)
def prepare_retrieve(self, keys, instance_id):
obj_keys = ipc_key_to_object_keys(...)
# SHM path: manually acquire read locks WITHOUT using the context manager,
# because the lock must be held until the worker calls finish_read().
#
# With the `with` statement, read_lock is auto-released on block exit —
# that would free the slot before the worker has read the data.
read_results = self.storage_manager.unsafe_read_prefetched(obj_keys)
if not read_results or len(read_results) != len(obj_keys):
# Some keys missing — release any locks we did acquire
if read_results:
self.storage_manager.finish_read_prefetched(
[k for k in obj_keys if k in read_results]
)
return PrepareRetrieveResponse(success=False, slots=[])
slots = []
for memory_obj in read_results:
slots.append(ShmSlotMetadata(
shm_name=self._shm_pool_name,
offset=memory_obj.shm_offset,
length=memory_obj.shm_byte_length,
shape=list(memory_obj.tensor.shape),
dtype=str(memory_obj.tensor.dtype).removeprefix("torch."),
))
# read_locks are held here and will NOT be released until
# the worker explicitly calls finish_read() after consuming the data.
return PrepareRetrieveResponse(success=True, slots=slots)
def finish_read(self, keys):
# Worker has finished reading from shm — now safe to release read_locks.
# After this, eviction controller may reclaim these slots.
self.storage_manager.finish_read_prefetched(keys)
Why not use with read_prefetched_results()?
The existing context manager (read_prefetched_results) is designed for the
CUDA path where the server reads the data itself within the with block and
then the lock is released on exit. In the shm path, data consumption happens
in a different process (the worker) at a later time. Using the context
manager would release the read_lock before the worker starts reading,
allowing eviction to reclaim the slot and causing data corruption.
4.3 Worker Side (Pseudocode)
def get_finished(self, ...):
response = rpc_result
if response.success:
for slot, block_ids in zip(response.slots, ...):
tensor_view = self._make_tensor_view(slot.offset, slot.length, slot.shape, slot.dtype)
scatter_cpu_chunks_to_kv(self.kv_caches, block_ids, [tensor_view], ...)
send_rpc(RequestType.FINISH_READ, [keys])
5. Lock and Eviction (No Changes Needed — Built-in Protection)
| Scenario |
Protection Mechanism |
| Worker writing to shm slot |
write_lock prevents eviction and concurrent reads |
| Worker reading from shm slot |
read_lock prevents eviction and concurrent writes |
| L1 memory full |
Eviction controller selects unlocked objs via LRU |
| Worker crashes without finish_write/read |
TTLLock auto-releases (write_ttl=600s, read_ttl=300s) |
reserve_write OOM |
Keys silently skipped (consistent with CUDA path); background eviction frees space over time |
reserve_write Internal Flow (Recap)
# Inside L1Manager.reserve_write():
for key in keys:
if key exists and mode == "new": skip
if key locked: skip
err, allocated_objs = memory_manager.allocate(layout_desc, count)
if err == OUT_OF_MEMORY:
# Keys marked as OUT_OF_MEMORY, silently skipped by caller
# Background eviction controller runs every 1s at watermark
return {key: (OUT_OF_MEMORY, None) for key in need_to_allocate}
entry.write_lock.lock()
return (SUCCESS, memory_obj)
6. Protocol Changes
6.1 New RequestTypes
class RequestType(enum.Enum):
# ... existing ...
PREPARE_STORE = enum.auto()
COMMIT_STORE = enum.auto()
PREPARE_RETRIEVE = enum.auto()
FINISH_READ = enum.auto()
6.2 Metadata Structures
class ShmSlotMetadata(msgspec.Struct):
key: str
shm_name: str
offset: int
length: int
shape: list[int]
dtype: str
class PrepareStoreResponse(msgspec.Struct):
slots: list[ShmSlotMetadata] # Only successfully allocated keys
class PrepareRetrieveResponse(msgspec.Struct):
success: bool
slots: list[ShmSlotMetadata]
6.3 No Path Selection — SHM Is the Only CPU Bounce Path
There is no configuration toggle or runtime path selection. The L1 memory pool is always backed by named shared memory. The previous pickle-over-ZMQ and ring buffer paths for CPU store/retrieve are fully replaced.
The CUDA GPU path (direct memcpy_async between GPU and L1 within the server process) is completely unaffected and continues to work exactly as before.
7. File Change Summary
| File |
Changes |
lmcache/v1/distributed/memory_manager.py |
Change L1MemoryManager.__init__ to use shm_open + mmap(MAP_SHARED) instead of anonymous mmap; add /dev/shm capacity check (fail-fast); expose shm_name via get_shm_pool_info(); add shm_unlink in close() |
lmcache/v1/memory_management.py |
Add shm_offset and shm_byte_length fields to MemoryObj; populate offset during slab allocation |
lmcache/v1/multiprocess/server.py |
Add prepare_store / commit_store / prepare_retrieve / finish_read handlers; return shm_name + pool_size in register_kv_cache response |
lmcache/v1/multiprocess/protocols/base.py |
Add 4 new RequestType entries |
lmcache/v1/multiprocess/protocols/engine.py |
Add 4 new protocol definitions; add ShmSlotMetadata, PrepareStoreResponse, PrepareRetrieveResponse |
lmcache/integration/vllm/vllm_multi_process_adapter.py |
Worker attach shm during register_kv_caches; implement _make_tensor_view; two-phase RPC store/retrieve replacing pickle path |
lmcache/v1/multiprocess/cpu_bounce_context.py |
Remove ring buffer logic; replace with shm tensor view gather/scatter |
Files that do NOT need changes: Slab allocator internals, eviction controller, lock management (TTLLock), CUDA GPU path, StoreController (L1→L2), PrefetchController (L2→L1).
8. Error Handling and Edge Cases
| Scenario |
Handling |
/dev/shm insufficient at startup |
Fail-fast with RuntimeError and actionable message (Docker --shm-size, K8s emptyDir) |
| Stale shm from previous crash |
shm_unlink then recreate on startup |
| Worker attach fails (shm not found) |
Wait for server ready or throw FileNotFoundError with clear message |
| Worker attach size mismatch |
Throw ValueError to prevent offset out-of-bounds |
Worker crashes without finish_write |
TTLLock timeout (600s) auto-releases write_lock |
Worker crashes without finish_read |
TTLLock timeout (300s) auto-releases read_lock |
| Server crashes |
Worker RPC timeout; cleanup script handles shm_unlink |
reserve_write OOM |
Keys silently skipped, consistent with CUDA path; background eviction frees space; OOM keys not stored to L2 |
| shm deleted externally at runtime |
Server detects error on next access; requires restart |
| Docker environment |
Documentation and error messages must clearly state --shm-size requirement |
9. Test Plan
Unit Tests
- shm create / attach / unlink lifecycle
- Startup fail-fast when
/dev/shm is insufficient
- Offset allocation and tensor view construction consistency
- Batch allocation + batch deallocation
- OOM scenario: verify keys are silently skipped (not retried, not stored to L2)
- TTLLock timeout auto-release
- Stale shm cleanup on restart
Integration Tests
- Multiple workers concurrently attach to the same shm pool
- Complete store two-phase RPC flow:
prepare_store → worker memcpy → commit_store
- Complete retrieve two-phase RPC flow:
prepare_retrieve → worker read → finish_read
- Store with partial OOM: verify only allocated keys are stored and committed
- Retrieve with missing keys: verify partial failure handling and lock cleanup
- Worker crash → lock auto-release → slot becomes evictable
- End-to-end: store via shm → L2 async store triggered → evict from L1 → retrieve from L2 via prefetch → read via shm
Performance Tests
- 128k token long-prompt store/retrieve latency comparison against previous pickle path
- Throughput comparison under sustained load
- Multi-worker concurrent stress test
Regression Tests
- CUDA GPU path full regression (must remain completely unaffected)
10. Performance Estimation
| Parameter |
Value |
| 128k tokens × 70KB/token |
~8.75 GiB |
| DDR5 single-socket bandwidth |
~40 GB/s |
| One copy latency (shm path) |
~220 ms |
| Two+ copies latency (previous pickle path) |
~440 ms |
| Two ZMQ IPC RPC round-trips (prepare + commit) |
~0.1 ms |
| Net savings per request |
~220 ms |
11. Future Extensions (Not In Scope)
- CUDA pinned memory:
cudaHostRegister on shm buffer for faster GPU DMA (requires evaluating ulimit -l limits)
- HugePage alignment: Use 2MB HugePages for shm mmap to reduce TLB misses
- NUMA affinity: Bind shm to specific NUMA node in multi-socket environments
- Slab compaction: Handle fragmentation after long-running operation
Label
Please label your issue with "new feature" and any other relevant labels so that it can easily be easily categorized under LMCache Onboarding
Is your feature request related to a problem? Please describe.
Currently, the LMCache L1 memory pool uses anonymous mmap, which makes the L1 memory invisible to worker processes. For CPU store/retrieve operations, data must be serialized (pickle) and transported over ZMQ between the worker and server, then deserialized and copied into the L1 slab. For very large prompts (e.g., 128k tokens × 70KB/token = ~8.75GB), this results in multiple memory copies and serialization overhead, leading to significant latency (~440ms per request for DDR5-class systems).
The previous ring buffer optimization attempted to reduce copies but still required one copy from ring buffer into L1 slab, and the ring buffer itself consumed a large pre-allocated memory region — effectively doubling the memory reservation without eliminating all copies.
Describe the solution you'd like
Replace the L1 memory pool's anonymous mmap with a named shared memory segment (
shm_open+mmap(MAP_SHARED)), making L1 slabs directly accessible to worker processes. Workers attach to the same shm pool once at startup, and receive offset/shape/dtype tuples for target memory slots at store/retrieve time.memcpydirectly into the L1 shm slot, then notify the server to commit.This is the only CPU bounce path. The previous pickle-over-ZMQ and ring buffer paths are replaced entirely. Since both anonymous mmap and named shm are backed by the same kernel mechanism (
mmap), the slab allocator, eviction controller, and lock management require zero changes.Describe alternatives you've considered
Additional context
Core Design: Why mmap and shm Are Fundamentally the Same
Both are backed by
mmapunder the hood; the only difference is visibility:mmap(MAP_ANONYMOUS)shm_open+mmap(MAP_SHARED)/dev/shm(tmpfs) → maps into virtual address spaceFrom the slab allocator's perspective, it sees a
(base_ptr, size)contiguous address space. Slab allocation logic, eviction logic, and lock management — none of them need any changes.1. L1 Memory Pool Changes (Server Side)
1.1 Memory Pool Source Change
1.2 Startup
/dev/shmCapacity Check — Fail-Fast1.3 MemoryObj: New
offsetField1.4 L1MemoryManager: Expose SHM Info
1.5 Cleanup Strategy
shm_unlinkExecStopPostor Docker entrypoint cleanup script as a safety net2. Worker Side: Attach and Tensor View Construction
2.1 Attach Flow
2.2 Tensor View Construction
2.3 Critical Constraints
torch.frombufferreturns a zero-copy view backed by shared memoryfinish_read; after that, the server may evict the slot at any time3. Store Path (Two-Phase RPC)
3.1 Complete Flow
3.2 OOM Behavior (Consistent with CUDA Path)
When
reserve_writereturnsOUT_OF_MEMORYfor some keys, those keys are silently skipped — exactly the same as the existing CUDA GPU store path inserver.py:The shm path follows the same pattern:
finish_write, which only runs for successfully allocated keys)3.3 Server Handler (Pseudocode)
3.4 Worker Side (Pseudocode)
4. Retrieve Path (Two-Phase RPC, Symmetric to Store)
4.1 Complete Flow
4.2 Server Handler (Pseudocode)
4.3 Worker Side (Pseudocode)
5. Lock and Eviction (No Changes Needed — Built-in Protection)
reserve_writeOOMreserve_write Internal Flow (Recap)
6. Protocol Changes
6.1 New RequestTypes
6.2 Metadata Structures
6.3 No Path Selection — SHM Is the Only CPU Bounce Path
There is no configuration toggle or runtime path selection. The L1 memory pool is always backed by named shared memory. The previous pickle-over-ZMQ and ring buffer paths for CPU store/retrieve are fully replaced.
The CUDA GPU path (direct
memcpy_asyncbetween GPU and L1 within the server process) is completely unaffected and continues to work exactly as before.7. File Change Summary
lmcache/v1/distributed/memory_manager.pyL1MemoryManager.__init__to useshm_open+mmap(MAP_SHARED)instead of anonymous mmap; add/dev/shmcapacity check (fail-fast); exposeshm_nameviaget_shm_pool_info(); addshm_unlinkinclose()lmcache/v1/memory_management.pyshm_offsetandshm_byte_lengthfields toMemoryObj; populate offset during slab allocationlmcache/v1/multiprocess/server.pyprepare_store/commit_store/prepare_retrieve/finish_readhandlers; return shm_name + pool_size inregister_kv_cacheresponselmcache/v1/multiprocess/protocols/base.pylmcache/v1/multiprocess/protocols/engine.pyShmSlotMetadata,PrepareStoreResponse,PrepareRetrieveResponselmcache/integration/vllm/vllm_multi_process_adapter.pyregister_kv_caches; implement_make_tensor_view; two-phase RPC store/retrieve replacing pickle pathlmcache/v1/multiprocess/cpu_bounce_context.pyFiles that do NOT need changes: Slab allocator internals, eviction controller, lock management (TTLLock), CUDA GPU path, StoreController (L1→L2), PrefetchController (L2→L1).
8. Error Handling and Edge Cases
/dev/shminsufficient at startup--shm-size, K8semptyDir)shm_unlinkthen recreate on startupfinish_writefinish_readshm_unlinkreserve_writeOOM--shm-sizerequirement9. Test Plan
Unit Tests
/dev/shmis insufficientIntegration Tests
prepare_store→ worker memcpy →commit_storeprepare_retrieve→ worker read →finish_readPerformance Tests
Regression Tests
10. Performance Estimation
11. Future Extensions (Not In Scope)
cudaHostRegisteron shm buffer for faster GPU DMA (requires evaluatingulimit -llimits)