GitHub - jooho-XCENA/maru: High-Performance KV Cache Storage Engine on CXL Shared Memory for LLM Inference

Maru: High-Performance KV Cache Storage Engine on CXL Shared Memory

Maru is a high-performance KV cache storage engine built on CXL shared memory, designed for LLM inference scenarios where multiple instances need to share a KV cache with minimal latency.

Every existing KV cache sharing solution assumes that sharing means transferring — copying data across the network, byte by byte. As models get larger and contexts get longer, that assumption becomes a structural bottleneck. Maru rejects the premise entirely: don't move data, share the memory. Instances read and write KV cache data directly in CXL shared memory. Only lightweight metadata (tens of bytes) travels between components.

The left shows how KV cache is shared without Maru; the right shows how it works with Maru. No copies — just direct access to CXL shared memory.

| Documentation |

Why Maru?

Zero-Copy Sharing — Transfer-based systems — whether CPU-mediated or GPU-direct — require the receiver to allocate staging buffers and move data across an interconnect. Maru eliminates this entire path: every instance reads from the same shared memory region directly. No buffer allocation, no data copy, no serialization.
Scales with Context Length and Concurrency — Network-based sharing degrades as contexts grow and more consumers hit the same KV. Maru never fans out KV payloads — scaling is bounded by shared-memory bandwidth, not network transfer.
Higher Hardware Utilization — Instead of duplicating KV caches per instance, all instances draw from a shared CXL pool. Less duplication means more usable memory and higher effective cache capacity.
Lower System Energy — Eliminating bulk data transfer cuts NIC and CPU power draw. Shorter data paths also reduce GPU idle time per request.

Overview

flowchart TB
    subgraph S1["Server 1"]
        direction TB
        I1(["LLM Instance"])
        H1{{"MaruHandler"}}
        I1 --- H1
    end
    subgraph S2["Server 2"]
        direction TB
        I2(["LLM Instance"])
        H2{{"MaruHandler"}}
        I2 --- H2
    end
    subgraph S3["Server 3"]
        direction TB
        I3(["LLM Instance"])
        H3{{"MaruHandler"}}
        I3 --- H3
    end

    M["Maru Control Plane"]

    subgraph CXL["CXL Shared Memory"]
        KV["KV Cache"]
    end

    H1 & H2 & H3 <-.->|"store/retrieve"| M
    H1 & H2 & H3 <==>|"direct read/write"| CXL
    M -.->|"manage"| CXL

Control Plane (dashed arrows) — KV metadata operations and region allocation.

Data Plane (solid arrows) — direct access to CXL shared memory, zero-copy. The data path is identical regardless of control plane mode.

Quick Start

Prerequisites

OS: Ubuntu 24.04 LTS+
Python: 3.12+
gcc: 13.3.0+, cmake: 3.28.3+
CXL DAX device (/dev/dax*) or emulation environment

sudo apt-get update
sudo apt-get install -y python3 python3-venv python3-pip git \
    build-essential cmake libnuma-dev

Installation

git clone https://github.com/xcena-dev/maru
cd maru

python3 -m venv .venv
source .venv/bin/activate
./install.sh

Verify the Maru Resource Manager daemon is running:

systemctl status maru-resourced

Start Services

# Start MaruServer (metadata server)
maru-server

# With custom host/port
maru-server --host 0.0.0.0 --port 5555

Basic Usage

from maru import MaruConfig, MaruHandler

config = MaruConfig(
    server_url="tcp://localhost:5555",
    pool_size=1024 * 1024 * 100,  # 100MB
)

with MaruHandler(config) as handler:
    data = b"A" * (1024 * 1024)  # 1MB KV chunk

    # 1. Allocate a page in CXL shared memory
    handle = handler.alloc(size=len(data))

    # 2. Write directly to CXL memory (mmap — no intermediate buffer)
    handle.buf[:] = data

    # 3. Register the key — only metadata (key → region, offset) is sent
    handler.store(key=42, handle=handle)

    # Retrieve: returns a memoryview pointing into CXL memory
    result = handler.retrieve(key=42)
    assert result is not None
    assert bytes(result.view[:5]) == b"AAAAA"

LMCache Integration

Maru works as a drop-in remote storage backend for LMCache via the maru:// URL scheme. It supports both P2P KV cache sharing and disaggregated prefill scenarios.

# LMCache config
remote_url: "maru://localhost:5555"
extra_config:
  maru_pool_size: "4G"

For full configuration details, see the documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
maru		maru
maru_common		maru_common
maru_handler		maru_handler
maru_lmcache		maru_lmcache
maru_resource_manager		maru_resource_manager
maru_server		maru_server
maru_shm		maru_shm
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Maru: High-Performance KV Cache Storage Engine on CXL Shared Memory

Why Maru?

Overview

Quick Start

Prerequisites

Installation

Start Services

Basic Usage

LMCache Integration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Maru: High-Performance KV Cache Storage Engine on CXL Shared Memory

Why Maru?

Overview

Quick Start

Prerequisites

Installation

Start Services

Basic Usage

LMCache Integration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages