Support SHM-Backed L1 Memory Pool for Zero-Copy CPU Store/Retrieve Path

**Label**
Please label your issue with "new feature" and any other relevant labels so that it can easily be easily categorized under [LMCache Onboarding](https://github.com/LMCache/LMCache/issues/1882)

**Is your feature request related to a problem? Please describe.**
Currently, the LMCache L1 memory pool uses anonymous mmap, which makes the L1 memory invisible to worker processes. For CPU store/retrieve operations, data must be serialized (pickle) and transported over ZMQ between the worker and server, then deserialized and copied into the L1 slab. For very large prompts (e.g., 128k tokens × 70KB/token = ~8.75GB), this results in multiple memory copies and serialization overhead, leading to significant latency (~440ms per request for DDR5-class systems).

The previous ring buffer optimization attempted to reduce copies but still required one copy from ring buffer into L1 slab, and the ring buffer itself consumed a large pre-allocated memory region — effectively doubling the memory reservation without eliminating all copies.

**Describe the solution you'd like**
Replace the L1 memory pool's anonymous mmap with a named shared memory segment (`shm_open` + `mmap(MAP_SHARED)`), making L1 slabs directly accessible to worker processes. Workers attach to the same shm pool once at startup, and receive offset/shape/dtype tuples for target memory slots at store/retrieve time.

- For store: Workers gather KV data to CPU, request an L1 slot offset from the server, `memcpy` directly into the L1 shm slot, then notify the server to commit.
- For retrieve: Workers request slot offsets for keys, construct a zero-copy tensor view backed by the shm offset, copy into their local paged KV cache, then notify the server to release the read lock.

This is the **only** CPU bounce path. The previous pickle-over-ZMQ and ring buffer paths are replaced entirely. Since both anonymous mmap and named shm are backed by the same kernel mechanism (`mmap`), the slab allocator, eviction controller, and lock management require **zero changes**.

**Describe alternatives you've considered**
- **Ring buffer zero-copy**: Still requires one copy from ring buffer into L1 slab, and the ring buffer itself requires pre-allocating a large memory region — doubling memory usage without eliminating all copies.
- **Pickle-over-ZMQ (current path)**: Multiple serialization/deserialization steps plus memory copies. No longer needed since workers can directly access L1 via shm.
- **Keeping pickle as a fallback**: Since both paths share the same L1 pool capacity and the same OOM behavior (silently skip keys, same as CUDA path), a pickle fallback provides no additional capability — it only adds code complexity.
- **GPU pinned memory**: Not required for this feature; could be a future extension.

**Additional context**

## Core Design: Why mmap and shm Are Fundamentally the Same

Both are backed by `mmap` under the hood; the only difference is visibility:

| | `mmap(MAP_ANONYMOUS)` | `shm_open` + `mmap(MAP_SHARED)` |
|---|---|---|
| Kernel operation | Maps anonymous pages into process virtual address space | Creates a file on `/dev/shm` (tmpfs) → maps into virtual address space |
| Physical page allocation | Lazy, allocated on first write | Lazy, allocated on first write |
| Cross-process visibility | ❌ | ✅ Multiple processes can map the same named segment |

**From the slab allocator's perspective, it sees a `(base_ptr, size)` contiguous address space.** Slab allocation logic, eviction logic, and lock management — **none of them need any changes**.

---

## 1. L1 Memory Pool Changes (Server Side)

### 1.1 Memory Pool Source Change

```python
# Current (pseudocode)
buffer = mmap(size, MAP_PRIVATE | MAP_ANONYMOUS)
allocator = SlabAllocator(buffer, size)

# New shm (pseudocode)
fd = shm_open("/lmcache_l1_pool", O_CREAT | O_RDWR)
ftruncate(fd, size)
buffer = mmap(size, MAP_SHARED, fd)
allocator = SlabAllocator(buffer, size)   # ← completely unchanged
```

### 1.2 Startup `/dev/shm` Capacity Check — Fail-Fast

```python
import shutil

def _check_shm_capacity(required_bytes: int) -> None:
    """Verify /dev/shm has sufficient space. Fail-fast if not."""
    shm_stat = shutil.disk_usage("/dev/shm")
    if shm_stat.free < required_bytes:
        raise RuntimeError(
            f"Insufficient /dev/shm space: need {required_bytes / 2**30:.1f} GiB, "
            f"available {shm_stat.free / 2**30:.1f} GiB. "
            f"Use 'docker run --shm-size={required_bytes * 2 // 2**30}g' or "
            f"set Kubernetes emptyDir.medium=Memory to increase /dev/shm size."
        )
```

> **Important**: Docker defaults to only 64MB for `/dev/shm`. Use `--shm-size` to enlarge. Kubernetes requires `emptyDir.medium: Memory`.
>
> Since SHM is the only CPU bounce path, insufficient `/dev/shm` means the server **cannot function for CPU store/retrieve**. The server must fail-fast with a clear, actionable error message rather than silently degrading.

### 1.3 MemoryObj: New `offset` Field

```python
@dataclass
class MemoryObj:
    tensor: torch.Tensor
    shm_offset: int          # ← NEW: offset relative to shm base_ptr
    shm_byte_length: int     # ← NEW: byte size of this obj in the pool
    # ... other fields unchanged
```

### 1.4 L1MemoryManager: Expose SHM Info

```python
def get_shm_pool_info(self) -> dict:
    return {
        "shm_name": self._shm_name,       # e.g. "/lmcache_l1_pool"
        "pool_size": self._size_in_bytes,
    }
```

### 1.5 Cleanup Strategy

- Server is the **sole owner** of the shm segment and is responsible for `shm_unlink`
- Workers only attach; they never unlink
- On startup, if a stale shm with the same name exists (leftover from a previous crash), the server proactively unlinks and recreates it
- Recommended: add a systemd `ExecStopPost` or Docker entrypoint cleanup script as a safety net

---

## 2. Worker Side: Attach and Tensor View Construction

### 2.1 Attach Flow

```python
from multiprocessing import shared_memory

# Worker receives shm_name and pool_size from server during register_kv_caches
shm = shared_memory.SharedMemory(name="lmcache_l1_pool", create=False)
assert shm.size == expected_pool_size, "SHM pool size mismatch"

# Cache the handle for the worker's entire lifetime (attach only once)
self._l1_shm = shm
self._l1_buffer = shm.buf
```

### 2.2 Tensor View Construction

```python
def _make_tensor_view(self, offset: int, length: int, shape: list[int], dtype: str) -> torch.Tensor:
    torch_dtype = getattr(torch, dtype)
    buf_view = self._l1_buffer[offset:offset + length]
    return torch.frombuffer(buf_view, dtype=torch_dtype).view(*shape)
```

### 2.3 Critical Constraints

- `torch.frombuffer` returns a **zero-copy view** backed by shared memory
- The worker **must** finish all data consumption (copy into paged KV cache) before calling `finish_read`; after that, the server may evict the slot at any time
- Workers must **not** hold long-lived references to shm tensor views

---

## 3. Store Path (Two-Phase RPC)

### 3.1 Complete Flow

```
┌──────────┐                              ┌──────────┐
│  Worker  │                              │  Server  │
└────┬─────┘                              └────┬─────┘
     │                                         │
     │  1. prepare_store(keys, shape, dtype)    │
     │────────────────────────────────────────→│
     │                                         │  reserve_write() → slab allocate
     │                                         │  Keys that fail OOM are silently
     │                                         │  skipped (same as CUDA path)
     │                                         │  write_lock acquired for each
     │                                         │  successfully allocated key
     │  2. response: [(key, offset, shape)]    │
     │     (only successfully allocated keys)   │
     │←────────────────────────────────────────│
     │                                         │
     │  3. Worker memcpy directly into         │
     │     shm[offset] for each returned key   │
     │     tensor_view.copy_(cpu_chunk)        │
     │                                         │
     │  4. commit_store(keys)                  │
     │────────────────────────────────────────→│
     │                                         │  finish_write() → release write_lock
     │                                         │  Data now visible to readers
     │                                         │  StoreController notified →
     │                                         │  async L2 store triggered
     │  5. response: success                   │
     │←────────────────────────────────────────│
```

### 3.2 OOM Behavior (Consistent with CUDA Path)

When `reserve_write` returns `OUT_OF_MEMORY` for some keys, those keys are **silently skipped** — exactly the same as the existing CUDA GPU store path in `server.py`:

```python
# Existing CUDA path behavior (server.py L324-L328):
for idx, obj_key in enumerate(obj_keys):
    if obj_key in reserved_dict:
        memory_obj = reserved_dict[obj_key]
    else:
        continue    # ← OOM keys silently skipped
```

The shm path follows the same pattern:
- No retry, no fallback, no error reported to the worker for skipped keys
- The background eviction controller will eventually free space
- Future store requests may succeed for those keys
- OOM keys are **not** stored to L2 either (L2 store is triggered by `finish_write`, which only runs for successfully allocated keys)

### 3.3 Server Handler (Pseudocode)

```python
def prepare_store(self, keys, instance_id, shape, dtype):
    obj_keys = ipc_key_to_object_keys(...)
    layout_desc = get_layout_desc(...)

    # reserve_write returns only successfully allocated keys
    # OOM keys are filtered out (same as CUDA path)
    reserved_dict = self.storage_manager.reserve_write(obj_keys, layout_desc, "new")

    slots = []
    for obj_key, memory_obj in reserved_dict.items():
        slots.append(ShmSlotMetadata(
            key=str(obj_key),
            shm_name=self._shm_pool_name,
            offset=memory_obj.shm_offset,
            length=memory_obj.shm_byte_length,
            shape=list(memory_obj.tensor.shape),
            dtype=str(memory_obj.tensor.dtype).removeprefix("torch."),
        ))
    return PrepareStoreResponse(slots=slots)

def commit_store(self, keys):
    self.storage_manager.finish_write(keys)
    return True
```

### 3.4 Worker Side (Pseudocode)

```python
def submit_store_request(self, ...):
    device_synchronize(self._device_type)
    cpu_chunks = gather_chunks_to_cpu(self.kv_caches, block_ids, ...)

    # Ask server for L1 shm slots
    response = send_rpc(RequestType.PREPARE_STORE, [key, instance_id, shape, dtype])

    # Only write keys that were successfully allocated (OOM keys not in response)
    for chunk, slot in zip(cpu_chunks, response.slots):
        tensor_view = self._make_tensor_view(slot.offset, slot.length, slot.shape, slot.dtype)
        tensor_view.copy_(chunk)

    if response.slots:
        send_rpc(RequestType.COMMIT_STORE, [slot.key for slot in response.slots])
```

---

## 4. Retrieve Path (Two-Phase RPC, Symmetric to Store)

### 4.1 Complete Flow

```
┌──────────┐                              ┌──────────┐
│  Worker  │                              │  Server  │
└────┬─────┘                              └────┬─────┘
     │                                         │
     │  1. prepare_retrieve(keys)              │
     │────────────────────────────────────────→│
     │                                         │  Check L1 hit
     │                                         │  If L2 miss → prefetch → wait ready
     │                                         │  Manually acquire read_lock
     │                                         │  (NOT via context manager)
     │  2. response: [(key, offset, shape)]    │
     │     read_lock held for returned keys    │
     │←────────────────────────────────────────│
     │                                         │
     │  3. Worker reads directly from          │
     │     shm[offset] for each returned key   │
     │     tensor_view = frombuffer(...)       │
     │     scatter_to_kv(tensor_view)          │
     │                                         │
     │  4. finish_read(keys)                   │
     │────────────────────────────────────────→│
     │                                         │  finish_read() → release read_lock
     │                                         │  Eviction controller may now
     │                                         │  reclaim these slots
     │  5. response: success                   │
     │←────────────────────────────────────────│
```

### 4.2 Server Handler (Pseudocode)

```python
def prepare_retrieve(self, keys, instance_id):
    obj_keys = ipc_key_to_object_keys(...)

    # SHM path: manually acquire read locks WITHOUT using the context manager,
    # because the lock must be held until the worker calls finish_read().
    #
    # With the `with` statement, read_lock is auto-released on block exit —
    # that would free the slot before the worker has read the data.
    read_results = self.storage_manager.unsafe_read_prefetched(obj_keys)
    if not read_results or len(read_results) != len(obj_keys):
        # Some keys missing — release any locks we did acquire
        if read_results:
            self.storage_manager.finish_read_prefetched(
                [k for k in obj_keys if k in read_results]
            )
        return PrepareRetrieveResponse(success=False, slots=[])

    slots = []
    for memory_obj in read_results:
        slots.append(ShmSlotMetadata(
            shm_name=self._shm_pool_name,
            offset=memory_obj.shm_offset,
            length=memory_obj.shm_byte_length,
            shape=list(memory_obj.tensor.shape),
            dtype=str(memory_obj.tensor.dtype).removeprefix("torch."),
        ))
    # read_locks are held here and will NOT be released until
    # the worker explicitly calls finish_read() after consuming the data.
    return PrepareRetrieveResponse(success=True, slots=slots)

def finish_read(self, keys):
    # Worker has finished reading from shm — now safe to release read_locks.
    # After this, eviction controller may reclaim these slots.
    self.storage_manager.finish_read_prefetched(keys)
```

> **Why not use `with read_prefetched_results()`?**
>
> The existing context manager (`read_prefetched_results`) is designed for the
> CUDA path where the server reads the data itself within the `with` block and
> then the lock is released on exit. In the shm path, data consumption happens
> in a **different process** (the worker) at a later time. Using the context
> manager would release the read_lock before the worker starts reading,
> allowing eviction to reclaim the slot and causing data corruption.

### 4.3 Worker Side (Pseudocode)

```python
def get_finished(self, ...):
    response = rpc_result

    if response.success:
        for slot, block_ids in zip(response.slots, ...):
            tensor_view = self._make_tensor_view(slot.offset, slot.length, slot.shape, slot.dtype)
            scatter_cpu_chunks_to_kv(self.kv_caches, block_ids, [tensor_view], ...)
        send_rpc(RequestType.FINISH_READ, [keys])
```

---

## 5. Lock and Eviction (No Changes Needed — Built-in Protection)

| Scenario | Protection Mechanism |
|----------|---------------------|
| Worker writing to shm slot | write_lock prevents eviction and concurrent reads |
| Worker reading from shm slot | read_lock prevents eviction and concurrent writes |
| L1 memory full | Eviction controller selects unlocked objs via LRU |
| Worker crashes without finish_write/read | TTLLock auto-releases (write_ttl=600s, read_ttl=300s) |
| `reserve_write` OOM | Keys silently skipped (consistent with CUDA path); background eviction frees space over time |

### reserve_write Internal Flow (Recap)

```python
# Inside L1Manager.reserve_write():
for key in keys:
    if key exists and mode == "new": skip
    if key locked: skip

err, allocated_objs = memory_manager.allocate(layout_desc, count)
if err == OUT_OF_MEMORY:
    # Keys marked as OUT_OF_MEMORY, silently skipped by caller
    # Background eviction controller runs every 1s at watermark
    return {key: (OUT_OF_MEMORY, None) for key in need_to_allocate}

entry.write_lock.lock()
return (SUCCESS, memory_obj)
```

---

## 6. Protocol Changes

### 6.1 New RequestTypes

```python
class RequestType(enum.Enum):
    # ... existing ...
    PREPARE_STORE = enum.auto()
    COMMIT_STORE = enum.auto()
    PREPARE_RETRIEVE = enum.auto()
    FINISH_READ = enum.auto()
```

### 6.2 Metadata Structures

```python
class ShmSlotMetadata(msgspec.Struct):
    key: str
    shm_name: str
    offset: int
    length: int
    shape: list[int]
    dtype: str

class PrepareStoreResponse(msgspec.Struct):
    slots: list[ShmSlotMetadata]     # Only successfully allocated keys

class PrepareRetrieveResponse(msgspec.Struct):
    success: bool
    slots: list[ShmSlotMetadata]
```

### 6.3 No Path Selection — SHM Is the Only CPU Bounce Path

There is no configuration toggle or runtime path selection. The L1 memory pool is always backed by named shared memory. The previous pickle-over-ZMQ and ring buffer paths for CPU store/retrieve are fully replaced.

The CUDA GPU path (direct `memcpy_async` between GPU and L1 within the server process) is **completely unaffected** and continues to work exactly as before.

---

## 7. File Change Summary

| File | Changes |
|------|---------|
| `lmcache/v1/distributed/memory_manager.py` | Change `L1MemoryManager.__init__` to use `shm_open` + `mmap(MAP_SHARED)` instead of anonymous mmap; add `/dev/shm` capacity check (fail-fast); expose `shm_name` via `get_shm_pool_info()`; add `shm_unlink` in `close()` |
| `lmcache/v1/memory_management.py` | Add `shm_offset` and `shm_byte_length` fields to `MemoryObj`; populate offset during slab allocation |
| `lmcache/v1/multiprocess/server.py` | Add `prepare_store` / `commit_store` / `prepare_retrieve` / `finish_read` handlers; return shm_name + pool_size in `register_kv_cache` response |
| `lmcache/v1/multiprocess/protocols/base.py` | Add 4 new RequestType entries |
| `lmcache/v1/multiprocess/protocols/engine.py` | Add 4 new protocol definitions; add `ShmSlotMetadata`, `PrepareStoreResponse`, `PrepareRetrieveResponse` |
| `lmcache/integration/vllm/vllm_multi_process_adapter.py` | Worker attach shm during `register_kv_caches`; implement `_make_tensor_view`; two-phase RPC store/retrieve replacing pickle path |
| `lmcache/v1/multiprocess/cpu_bounce_context.py` | Remove ring buffer logic; replace with shm tensor view gather/scatter |

**Files that do NOT need changes**: Slab allocator internals, eviction controller, lock management (TTLLock), CUDA GPU path, StoreController (L1→L2), PrefetchController (L2→L1).

---

## 8. Error Handling and Edge Cases

| Scenario | Handling |
|----------|----------|
| `/dev/shm` insufficient at startup | **Fail-fast** with RuntimeError and actionable message (Docker `--shm-size`, K8s `emptyDir`) |
| Stale shm from previous crash | `shm_unlink` then recreate on startup |
| Worker attach fails (shm not found) | Wait for server ready or throw FileNotFoundError with clear message |
| Worker attach size mismatch | Throw ValueError to prevent offset out-of-bounds |
| Worker crashes without `finish_write` | TTLLock timeout (600s) auto-releases write_lock |
| Worker crashes without `finish_read` | TTLLock timeout (300s) auto-releases read_lock |
| Server crashes | Worker RPC timeout; cleanup script handles `shm_unlink` |
| `reserve_write` OOM | Keys silently skipped, consistent with CUDA path; background eviction frees space; OOM keys not stored to L2 |
| shm deleted externally at runtime | Server detects error on next access; requires restart |
| Docker environment | Documentation and error messages must clearly state `--shm-size` requirement |

---

## 9. Test Plan

**Unit Tests**
- shm create / attach / unlink lifecycle
- Startup fail-fast when `/dev/shm` is insufficient
- Offset allocation and tensor view construction consistency
- Batch allocation + batch deallocation
- OOM scenario: verify keys are silently skipped (not retried, not stored to L2)
- TTLLock timeout auto-release
- Stale shm cleanup on restart

**Integration Tests**
- Multiple workers concurrently attach to the same shm pool
- Complete store two-phase RPC flow: `prepare_store` → worker memcpy → `commit_store`
- Complete retrieve two-phase RPC flow: `prepare_retrieve` → worker read → `finish_read`
- Store with partial OOM: verify only allocated keys are stored and committed
- Retrieve with missing keys: verify partial failure handling and lock cleanup
- Worker crash → lock auto-release → slot becomes evictable
- End-to-end: store via shm → L2 async store triggered → evict from L1 → retrieve from L2 via prefetch → read via shm

**Performance Tests**
- 128k token long-prompt store/retrieve latency comparison against previous pickle path
- Throughput comparison under sustained load
- Multi-worker concurrent stress test

**Regression Tests**
- CUDA GPU path full regression (must remain completely unaffected)

---

## 10. Performance Estimation

| Parameter | Value |
|-----------|-------|
| 128k tokens × 70KB/token | ~8.75 GiB |
| DDR5 single-socket bandwidth | ~40 GB/s |
| One copy latency (shm path) | ~220 ms |
| Two+ copies latency (previous pickle path) | ~440 ms |
| Two ZMQ IPC RPC round-trips (prepare + commit) | ~0.1 ms |
| **Net savings per request** | **~220 ms** |

---

## 11. Future Extensions (Not In Scope)

- **CUDA pinned memory**: `cudaHostRegister` on shm buffer for faster GPU DMA (requires evaluating `ulimit -l` limits)
- **HugePage alignment**: Use 2MB HugePages for shm mmap to reduce TLB misses
- **NUMA affinity**: Bind shm to specific NUMA node in multi-socket environments
- **Slab compaction**: Handle fragmentation after long-running operation

File	Changes
`lmcache/v1/distributed/memory_manager.py`	Change `L1MemoryManager.__init__` to use `shm_open` + `mmap(MAP_SHARED)` instead of anonymous mmap; add `/dev/shm` capacity check (fail-fast); expose `shm_name` via `get_shm_pool_info()`; add `shm_unlink` in `close()`
`lmcache/v1/memory_management.py`	Add `shm_offset` and `shm_byte_length` fields to `MemoryObj`; populate offset during slab allocation
`lmcache/v1/multiprocess/server.py`	Add `prepare_store` / `commit_store` / `prepare_retrieve` / `finish_read` handlers; return shm_name + pool_size in `register_kv_cache` response
`lmcache/v1/multiprocess/protocols/base.py`	Add 4 new RequestType entries
`lmcache/v1/multiprocess/protocols/engine.py`	Add 4 new protocol definitions; add `ShmSlotMetadata`, `PrepareStoreResponse`, `PrepareRetrieveResponse`
`lmcache/integration/vllm/vllm_multi_process_adapter.py`	Worker attach shm during `register_kv_caches`; implement `_make_tensor_view`; two-phase RPC store/retrieve replacing pickle path
`lmcache/v1/multiprocess/cpu_bounce_context.py`	Remove ring buffer logic; replace with shm tensor view gather/scatter

	`mmap(MAP_ANONYMOUS)`	`shm_open` + `mmap(MAP_SHARED)`
Kernel operation	Maps anonymous pages into process virtual address space	Creates a file on `/dev/shm` (tmpfs) → maps into virtual address space
Physical page allocation	Lazy, allocated on first write	Lazy, allocated on first write
Cross-process visibility	❌	✅ Multiple processes can map the same named segment

Scenario	Protection Mechanism
Worker writing to shm slot	write_lock prevents eviction and concurrent reads
Worker reading from shm slot	read_lock prevents eviction and concurrent writes
L1 memory full	Eviction controller selects unlocked objs via LRU
Worker crashes without finish_write/read	TTLLock auto-releases (write_ttl=600s, read_ttl=300s)
`reserve_write` OOM	Keys silently skipped (consistent with CUDA path); background eviction frees space over time

Scenario	Handling
`/dev/shm` insufficient at startup	Fail-fast with RuntimeError and actionable message (Docker `--shm-size`, K8s `emptyDir`)
Stale shm from previous crash	`shm_unlink` then recreate on startup
Worker attach fails (shm not found)	Wait for server ready or throw FileNotFoundError with clear message
Worker attach size mismatch	Throw ValueError to prevent offset out-of-bounds
Worker crashes without `finish_write`	TTLLock timeout (600s) auto-releases write_lock
Worker crashes without `finish_read`	TTLLock timeout (300s) auto-releases read_lock
Server crashes	Worker RPC timeout; cleanup script handles `shm_unlink`
`reserve_write` OOM	Keys silently skipped, consistent with CUDA path; background eviction frees space; OOM keys not stored to L2
shm deleted externally at runtime	Server detects error on next access; requires restart
Docker environment	Documentation and error messages must clearly state `--shm-size` requirement

Parameter	Value
128k tokens × 70KB/token	~8.75 GiB
DDR5 single-socket bandwidth	~40 GB/s
One copy latency (shm path)	~220 ms
Two+ copies latency (previous pickle path)	~440 ms
Two ZMQ IPC RPC round-trips (prepare + commit)	~0.1 ms
Net savings per request	~220 ms

Support SHM-Backed L1 Memory Pool for Zero-Copy CPU Store/Retrieve Path #244

Description

Core Design: Why mmap and shm Are Fundamentally the Same

1. L1 Memory Pool Changes (Server Side)

1.1 Memory Pool Source Change

1.2 Startup /dev/shm Capacity Check — Fail-Fast

1.3 MemoryObj: New offset Field

1.4 L1MemoryManager: Expose SHM Info

1.5 Cleanup Strategy

2. Worker Side: Attach and Tensor View Construction

2.1 Attach Flow

2.2 Tensor View Construction

2.3 Critical Constraints

3. Store Path (Two-Phase RPC)

3.1 Complete Flow

3.2 OOM Behavior (Consistent with CUDA Path)

3.3 Server Handler (Pseudocode)

3.4 Worker Side (Pseudocode)

4. Retrieve Path (Two-Phase RPC, Symmetric to Store)

4.1 Complete Flow

4.2 Server Handler (Pseudocode)

4.3 Worker Side (Pseudocode)

5. Lock and Eviction (No Changes Needed — Built-in Protection)

reserve_write Internal Flow (Recap)

6. Protocol Changes

6.1 New RequestTypes

6.2 Metadata Structures

6.3 No Path Selection — SHM Is the Only CPU Bounce Path

7. File Change Summary

8. Error Handling and Edge Cases

9. Test Plan

10. Performance Estimation

11. Future Extensions (Not In Scope)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1.2 Startup `/dev/shm` Capacity Check — Fail-Fast

1.3 MemoryObj: New `offset` Field