Skip to content

feat: fleet-wide L2 usage tracking and quota-based eviction#2

Open
aeon-x wants to merge 60 commits into
devfrom
feat/l2-usage-tracking-eviction
Open

feat: fleet-wide L2 usage tracking and quota-based eviction#2
aeon-x wants to merge 60 commits into
devfrom
feat/l2-usage-tracking-eviction

Conversation

@aeon-x

@aeon-x aeon-x commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add per-cache_salt L2 usage accounting, quota management, and LRU eviction to the MP coordinator
  • MP servers report L2 store/lookup events via a batching L2EventListener; the coordinator aggregates usage, enforces quotas, and selects LRU keys to evict
  • Reuses QuotaManager and IsolatedLRUEvictionPolicy from the distributed layer instead of reimplementing them
  • Includes REST endpoints (/l2/quota, /l2/events, /l2/status), config flags (--coordinator-l2-event-reporting, --coordinator-l2-event-flush-interval), and a design doc

Key changes

  • Coordinator L2 subsystem (lmcache/v1/mp_coordinator/l2/): L2UsageManager, L2EvictionManager (wraps IsolatedLRUEvictionPolicy for per-salt LRU), L2EventListener (batching reporter)
  • Quota management: Reuses QuotaManager from lmcache.v1.distributed.quota_manager (allowlist semantics — unregistered salts default to 0 limit)
  • Eviction logic: Aligned with L2EvictionController — watermark trigger (usage >= watermark * quota) and eviction by key count ratio (default 0.2)
  • REST API (lmcache/v1/mp_coordinator/http_apis/l2_api.py): quota CRUD, event ingestion, combined status queries; _default path sentinel maps to empty-string salt
  • Internal types: ObjectKey used throughout the coordinator; CacheKey only at API boundary for JSON serialization
  • Schemas (schemas.py): CacheKey, EventType enum, UsageEvent, quota/status models
  • MP server wiring (http_server.py): creates event listener and registers it on storage manager when l2_event_reporting is enabled
  • Interface update (L2AdapterListener.on_l2_keys_stored): now passes sizes alongside keys
  • Config (MPCoordinatorConfig): added trigger_watermark (default 1.0), changed eviction_ratio default from 0.5 to 0.2
  • Design doc (docs/design/v1/mp_coordinator/l2_usage_and_eviction.md)

Test plan

  • Unit tests for L2UsageManager, L2EvictionManager (LRU ordering, eviction ratio, watermark trigger, multi-salt independence, no-quota/zero-quota eviction)
  • Integration tests for L2 REST API (quota CRUD, event ingestion, status queries, validation, _default salt sentinel)
  • Manual: verify MP server registers events with coordinator when --coordinator-l2-event-reporting is set

🤖 Generated with Claude Code

maobaolong and others added 25 commits June 7, 2026 08:42
* feat: add POSIX SHM infra for CPU KV-cache IPC

- lmcache/v1/multiprocess/posix_shm.py: thin POSIX-SHM facade
  (shm_create_readwrite / shm_map_readwrite / shm_munmap / shm_unlink /
  shm_open_pool_as_mmap) routing through CPython's _posixshmem to
  avoid macOS EACCES and shutdown BufferError issues
- lmcache/v1/platform/cpu/shm.py: CpuShmTensorWrapper + migrate_to_shm_and_wrap
  for zero-copy CPU KV-cache IPC mirroring CUDA-IPC semantics
- lmcache/v1/platform/cpu/__init__.py: self-register cpu factory with
  platform registry
- tests/v1/multiprocess/test_posix_shm.py: unit tests for posix_shm
- tests/v1/platform/test_cpu_shm.py: unit tests for CpuShmTensorWrapper

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* address comment

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* address comment

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* assert zero storage_offset before SHM migration

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* add warning logs to swallowed exceptions in posix_shm

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

---------

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…nector __str__ (LMCache#3577)

Normalize flat/nested block_ids in flat_block_ids and connector __str__

Older vLLM connectors emit a flat list[int] for the single non-hybrid
group, while newer ones use nested list[list[int]]. Make flat_block_ids
and the three LMCacheMPConnectorMetadata.__str__ paths tolerate both,
matching the normalization already done in expand_block_ids_to_views().

Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: feixiangpeng <155504520+feixiangpeng@users.noreply.github.com>
…gned KV reuse (LMCache#3582)

* [blend-v3] Token-level matching + per-token slot scatter for CB reuse

Match fingerprints at token stride (probe_stride=1) and scatter reused
KV with the per-token slot kernel (multi_layer_kv_transfer) instead of
matching/scattering at vLLM block granularity. This lets CacheBlend
reuse non-block-aligned matches, the common case for real workloads
where the shared body starts at an arbitrary token offset (a partial
vLLM block) rather than a chunk/block boundary.

- register_rope: probe_stride = 1 (find matches at any token offset)
- cb_unified_lookup: accept non-prefix matches at any cur_st (drop the
  chunk-alignment filter)
- cb_retrieve_pre_computed: per-token slot scatter of the full matched
  range. Partial vLLM blocks are written per slot, so matched and
  recomputed tokens sharing a block don't conflict. Removes the
  block-aligned drop checks and the now-dead whole-block scatter path.

Validated on prefix-suffix-tuner (non-block-aligned by construction):
~99% suffix hit, 3.91x TTFT vs full recompute, output matches the
full-recompute baseline. The slot kernel is bandwidth-bound and matches
the whole-block kernel's throughput (~700 GB/s), so no scatter overhead.

Signed-off-by: deng451e <838677410@qq.com>

* [blend-v3] Vectorize V3 matcher probe; drop obsolete probe stride

Token-level matching (probe_stride=1) had turned match_sub_sequence into
an O(tokens) pure-Python probe loop — ~5.7 ms at 32K context, ~7x the old
block-stride cost. Replace it with a vectorized direct-address probe
(numpy gather over all positions) plus a verify loop over only the
surviving hits; the table is sparse (TABLE_SIZE = 2^20 >> registered
chunks) so the hit set is tiny. This restores the base class's
vectorization that the V3 override had dropped, keeping full-hash
collision rejection.

Probe stride is now obsolete (we always scan every position), so the
_probe_stride field, ctor arg, and register_rope assignment are removed.

Matcher microbench (CPU, per lookup): 32K ctx 5.66 -> 0.83 ms (~7x),
20K 3.43 -> 0.52, 8K 1.39 -> 0.23 — back to the pre-token-scatter
block-stride baseline with full token-level matching. All 20
test_optimized_lookup_v3 tests pass.

Signed-off-by: deng451e <838677410@qq.com>

* update

Signed-off-by: deng451e <838677410@qq.com>

* update stale docstring

Signed-off-by: deng451e <838677410@qq.com>

---------

Signed-off-by: deng451e <838677410@qq.com>
…brid models (LMCache#3557)

Signed-off-by: ApostaC <yihua@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: ApostaC <yihua@tensormesh.ai>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…d_dim/CLA calculation, add i18n UI (LMCache#2834)

* examples(kv_cache_calculator): add Hunyuan & DeepSeek models, UI i18n, prefer local modelconfig

Signed-off-by: KimmoZAG <995496585@qq.com>

* fix(kv_cache_calculator): use prefix match for DeepSeek V3 variants; consolidate head_dim logic

Signed-off-by: KimmoZAG <995496585@qq.com>

---------

Signed-off-by: KimmoZAG <995496585@qq.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…ocess_adapter (LMCache#3478)

Signed-off-by: Yujie Liu <milan021007@163.com>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* feat(mp): add SHM-based NonGpuContext (server-side copy)  (LMCache#3346)

* feat(mp): add SHM-based NonGpuContext (server-side copy)

Porting upstream PR LMCache#3328 (commit 2/2)

Adapted to current branch:

- non_cuda_equivalents.py changes redirected to python_ops_fallback.py (renamed)

- test_cache.py changes redirected to bench/test_cache.py (relocated)

- skipped manual registration in cli/commands/__init__.py (now uses dynamic discovery)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Refactor

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* address gemini review on shm + stage_block_ids

- shm.py: hook munmap+shm_unlink via weakref.finalize so mmap and SHM
  segments are released when migrated tensors / to_tensor views are GC-ed
- shm.py: stop using id(tensor) as registry key; clear stale entries on
  finalize and use a monotonic counter for SHM names so id reuse can't
  trigger EEXIST in shm_open(O_EXCL)
- shm.py: use numel*element_size for the wrapped tensor byte count so
  views of larger storages are sized correctly
- cache_context.py: reject empty/None block_ids and bound-check against
  block_ids_buffer_ in stage_block_ids

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* shm: guard fd/mmap with try/finally on error paths

ensure shm_create_readwrite / shm_map_readwrite never leak the fd
or the mmap when ftruncate or mmap fails. also rename _nbytes to
nbytes so the test can read it without poking a private attribute.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* shm: validate cached entry via weakref to defeat id reuse

Cache id(tensor)->(weakref, name) instead of id->name. Lookups
verify ref() is the same tensor before reusing the cached SHM
name; a stale entry left behind by a GC'd tensor whose id has
since been recycled now reads as a miss instead of crashing the
next migration with EEXIST.

Adds inject_stale_cache_entry_for_test so the new regression
test can simulate id recycling without poking module-private
state.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* address comments

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

---------

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Address comment

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

---------

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…#3433)

Signed-off-by: zhengfeihe <hezhengfei1999@gmail.com>
… path (LMCache#3600)

* refactor: utilize multi_layer_block_kv_transfer ops for data transfer path

Consolidate the data transfer path by utilizing the `multi_layer_block_kv_transfer` operation.
This update allows a single op to support both handle and data paths simultaneously,
streamlining the underlying transfer logic.

Signed-off-by: Tony Lin <tony.lin@intel.com>

* more comments for clarity

Signed-off-by: Tony Lin <tony.lin@intel.com>

* refactor(test): skip blocks-first fused KV tests on non-CPU devices

- Apply module-level pytestmark to skip all tests in this file when
  torch_device_type is not 'cpu', as the blocks-first fused shape
  (Format 10) is currently CPU-only.
- Move pytestmark to the top of the file for better clarity and
  correct test execution control.

Signed-off-by: Tony Lin <tony.lin@intel.com>

* add GPUKVFormat.NL_X_NB_NH_BS_TWO_HS in python fallback path

Signed-off-by: Tony Lin <tony.lin@intel.com>

* properly handle cuda kernel's limitation

Signed-off-by: Tony Lin <tony.lin@intel.com>

* fix bug

Signed-off-by: Tony Lin <tony.lin@intel.com>

---------

Signed-off-by: Tony Lin <tony.lin@intel.com>
…3596)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Reduce ci cpu e2e test memory request

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* use python to compute kv cache bytes so float values work

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

---------

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Javen Ke <javen@arcfra.com>
Signed-off-by: Shaoting-Feng <shaotingf@tensormesh.ai>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
try:
_quota_store(request).set(cache_salt, limit_bytes)
except ValueError as exc:
return JSONResponse(status_code=400, content={"error": str(exc)})
Signed-off-by: aeon-x <talexcao@gmail.com>
@aeon-x aeon-x force-pushed the feat/l2-usage-tracking-eviction branch from 480228a to efa6900 Compare June 10, 2026 23:46
chunxiaozheng and others added 2 commits June 10, 2026 23:52
* refactor: refactor query cli

Signed-off-by: idellzheng <idellzheng@tencent.com>

* refactor: refactor trace cli

Signed-off-by: idellzheng <idellzheng@tencent.com>

* bugfix

Signed-off-by: idellzheng <idellzheng@tencent.com>

---------

Signed-off-by: idellzheng <idellzheng@tencent.com>
…end (LMCache#2418)

* [Core] Add multipath KV-cache offloading support in LMCache NIXL backend


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Address feedback: add validate_nixl_path helper function and update NixlFilePool path handling


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Addresses PR feedback for documentation, unit tests, and formatting


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Use metadata.worker_id for path sharding instead of torch.cuda.current_device()

For CPU-buffer backends (POSIX, HF3FS), initialize_allocator does not call
torch.cuda.set_device(), so torch.cuda.current_device() may return 0 for
all workers, defeating multipath sharding. Replace with metadata.worker_id
which reliably distinguishes workers regardless of CUDA state.


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Use local_worker_id instead of worker_id for path sharding

In multi-node deployments, worker_id is the global rank which causes
inconsistent path distribution across nodes. local_worker_id is the
local GPU ID on the node, ensuring each node's GPUs map to paths
consistently (e.g. GPU 0 -> path0, GPU 1 -> path1 on every node).


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Fix code formatting (ruff format)


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Remove redundant assert in NixlFilePool.__init__

validate_nixl_path already checks for None path with a more
informative error message, making this assertion unnecessary.


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Add test_nixl_multipath.py to Buildkite unit-test ignore list

test_nixl_multipath.py imports NixlStorageConfig from
nixl_storage_backend.py, which has top-level nixl C extension imports.
When the nixl native bindings cannot fully load in the CI environment,
this causes an ImportError during pytest collection, and --maxfail=1
immediately aborts the entire test suite.

This matches the existing ignore for test_nixl_storage.py.


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Rebase NIXL multipath support to use PathSharder

- Remove validate_nixl_path method from NixlStorageConfig
- Update NixlFilePool to accept PathSharder instance
- Update createPool to use PathSharder with buffer_device
- Update tests to use PathSharder directly

This aligns with PR LMCache#2982 which centralized path sharding logic.


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Fix cursor bot issues: nixl_path validation and multipath sharding

1. Add validation to ensure nixl_path is not None
   - Add assert in from_cache_engine_config to validate nixl_path
   - Add assert in createPool as additional safeguard
   - Prevents TypeError when PathSharder receives None value

2. Fix CPU buffer device multipath sharding issue
   - Pass f'cuda:{metadata.worker_id}' to PathSharder instead of buffer_device
   - Ensures proper path selection based on worker_id for by_gpu sharding
   - Agent still uses correct buffer_device for memory allocation

These fixes resolve both high-severity issues identified by cursor bot.


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Add docstring and warning for nixl createPool path sharding

- Document createPool arguments/returns and path sharding behavior
- Warn when list paths contain commas (may affect sharding); PathSharder unchanged


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Resolve merge conflicts in nixl_storage_backend.py


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* fix: resolve buildkite pipeline merge conflict


Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>

* Fix: Restore missing use_hugepages extraction from merge conflict resolution

During the merge conflict resolution in commit 01da799, the use_hugepages
extraction line was accidentally deleted. This line is part of the huge pages
feature (commit a68bd0a) and is needed for the NIXL backend to properly
allocate CPU memory with hugepages support.

Changes:
- Restore use_hugepages: bool field in NixlStorageConfig dataclass
- Restore use_hugepages extraction: extra_config.get("nixl_use_hugepages", False)
- Remove unused import sys from test file (auto-fixed by ruff)

Signed-off-by: Ugur Kaynar <Ugur.kaynar@dell.com>

* [Core] Fix NIXL multipath PR: lint, tests, and dead code

Make CI green for the multipath KV-cache offloading change:

- createPool: remove the duplicate `elif backend in ("OBJ","AZURE_BLOB")`
  branch and the unreachable `return NixlFilePool(...)` left over from a
  merge; collapse back to the single OBJ/AZURE_BLOB/DOCA_MEMOS object-pool
  branch (no behavior change — OBJ/AZURE_BLOB still get b128=False).
- NixlDynamicStorageBackend: reject a list `nixl_path` at init. The dynamic
  backend uses self.path directly as a single directory, and path sharding
  across multiple paths is only implemented for static pools. This narrows
  self.path to str and fixes the three mypy str|list[str] arg-type errors
  by failing loud instead of silently mishandling a list.
- test_nixl_doca_memos: pass the new createPool path_sharding/dst_device
  args (ignored for object backends) to fix the missing-positional-arg
  failures.
- test_nixl_posix_backend_multipath: use the valid 5-element kv_shape
  torch.Size([4, 2, 256, 8, 128]) like the other run()-based tests; the
  previous [2048, 2048] shape crashed in metadata.get_shapes().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

* [Core] NIXL: don't require nixl_path for non-file backends

The multipath change added an unconditional
`assert path is not None, "nixl_path cannot be None"` in
NixlStorageConfig.from_cache_engine_config, which broke object/CPU
backends that legitimately have no path (e.g. OBJ/DOCA_MEMOS) — this is
what caused the test_nixl_shared_pool.py failures on CI.

Remove the unconditional assert; the existing conditional check already
requires a path only for the file backends that need one:

    if backend in ("GDS", "GDS_MT", "POSIX", "HF3FS"):
        assert path is not None, f"nixl_path must be provided for {backend} backend"

This restores the pre-PR behavior (path optional for object backends).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

---------

Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com>
Signed-off-by: Ugur Kaynar <Ugur.kaynar@dell.com>
Signed-off-by: Emine Ugur Kaynar <Ugur.Kaynar@dell.com>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
abinggo and others added 21 commits June 11, 2026 02:26
…ave (fixes LMCache#3318) (LMCache#3325)

Signed-off-by: abinggo <107740309+abinggo@users.noreply.github.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
…gh) changes (LMCache#3274)

* rust: Add io_uring sync write path for checkpoint

Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>

* enable io_uring from raw block plugin

The io_uring changes were omitted during the MP mode integration
This commit partially adds them back.
The request batching is only done for headers and payloads, since
for MP mode we need to order the requests as they may send together.
This will be fixed later.

Fixes: LMCache#3119

Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* Add nvme helpers to enable io_uring command

This adds the required nvme helpers for getting namespace information,
lba size etc. to enable io_uring command support.

NVMe io_uring command support (io_uring_cmd) enables asynchronous,
low-latency passthrough of NVMe commands directly from user space,
bypassing file system and most of block layer overhead.
Introduced in Linux kernel 5.19, it allows using IORING_OP_URING_CMD
for raw NVMe commands, featuring big SQE (128 bytes) / CQE (32 bytes)
support for larger command structures.

Unlike block device interface it requires nvme namespace character device
(/dev/ngXnY)

Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
Signed-off-by: Daegyu Han <daegyu94.han@samsung.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* rust Add io_uring command read write support

NVMe io_uring command utilizes
 * big Submission Queue Entries 128 bytes, standard is 64 bytes
 * and, big Completion Queue Entries 32 bytes, standard is 16 bytes.

The NVMe command is embedded within the last 80 bytes of the submission
queue entry.

The io_uring worker thread has been rebased for better readability.

The nvme namespace character device doesn't support I/O
sizes greater than /sys/block/nvmeXnY/queue/max_hw_sectors_kb
Usually the block layer handles I/O split for any requests
larger than this limitation, which is not there for char devices

To handle this add support for I/O splitting based on user specified
maximum data transfer limit. "rust_raw_block.max_data_transfer_size"
If not specified the commands will be auto split based to the queue
max_hw_sectors_kb limit.

Added comprehensive test suite for uring_command.
Expanded the raw block l2_adpater tests.

Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
Signed-off-by: Daegyu Han <daegyu94.han@samsung.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* doc: Add missing io_uring and use_uring_cmd docs

Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>

* fix (rawblock): callback for succeeded keys before raising error

Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>

---------

Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: Daegyu Han <daegyu94.han@samsung.com>
…instead of inferring from cache_config.block_size (LMCache#3616)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com>
Signed-off-by: Kushagra963-lab <147275307+Kushagra963-lab@users.noreply.github.com>
Co-authored-by: Kushagra963-lab <147275307+Kushagra963-lab@users.noreply.github.com>
* [cli] add quota management commands (set/get/list/delete)

Signed-off-by: idellzheng <idellzheng@tencent.com>

* checkstyle error fix

Signed-off-by: idellzheng <idellzheng@tencent.com>

---------

Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
…he#3404)

* feat: add Google Cloud Bigtable remote storage connector

- Integrate thread-safe gRPC AsyncPQExecutor for Point-Reads and Batched Mutations

- Implement numerical-precision partial chunk reshaping falling back to CPU recompute

- Incorporate 20MB MutateRow request payload thresholds and 10s TTLCache local shielding

- Include 100% portable dynamic gapic mock-isolated PyTest unit test coverage

- Append comprehensive user guide and integration documentation

- Add google-cloud-bigtable SDK to common requirements to resolve upstream GitHub Actions CI ImportErrors

- Implement graceful FileNotFoundError fallback for credentials_path to ensure 100% Buildkite K3 CI pipeline resilience

- Wrap all internal logger and warning strings to strictly comply with LMCache's 88-character ruff cap

Signed-off-by: An Nguyen <annenguyen@google.com>

Y

* update docs & address comment

Signed-off-by: An Nguyen <annenguyen@google.com>

* add bigtable bench

Signed-off-by: An Nguyen <annenguyen@google.com>

* fix mypy type checking for bigtable connector mock namespace

Signed-off-by: An Nguyen <annenguyen@google.com>

* fix: make bigtable max_chunk_size_mb default value consistent and add unit tests

Signed-off-by: An Nguyen <annenguyen@google.com>

* test: Add Bigtable Emulator integration tests and fix remove_sync context issues

Signed-off-by: An Nguyen <annenguyen@google.com>

* feat(storage): cache TableAsync and optimize remove_sync to fire-and-forget in Bigtable connector

- Lazily initialize and cache TableAsync in BigtableConnector to prevent memory registry leaks.
- Optimize remove_sync to be fire-and-forget to avoid blocking caller thread on the critical path, yielding a ~20% throughput improvement.
- Update unit tests to poll-wait for background deletion task before asserting.

Signed-off-by: An Nguyen <annenguyen@google.com>

* test: skip/reuse Bigtable emulator in CI, resolve mock pollution, and fix style lints

Signed-off-by: An Nguyen <annenguyen@google.com>

---------

Signed-off-by: An Nguyen <annenguyen@google.com>
Signed-off-by: ApostaC <yihua@tensormesh.ai>
…mba models (LMCache#3645)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: aeon-x <talexcao@gmail.com>
@aeon-x aeon-x force-pushed the feat/l2-usage-tracking-eviction branch from 819ad30 to a8a21ff Compare June 11, 2026 23:26
aeon-x and others added 8 commits June 11, 2026 16:51
Signed-off-by: aeon-x <talexcao@gmail.com>
…notable speedup (LMCache#3591)

* Perf: optimize Python fallback block transfer for 3x speedup

- Optimize fallback block-id and D2H staging overhead
- Restructure per-layer transfer loops to iterate over objects first
  then layers

Signed-off-by: Tony Lin <tony.lin@intel.com>

* apply gemini's suggestion

Signed-off-by: Tony Lin <tony.lin@intel.com>

* optimize flash_infer block transfer paths in python fallback

Signed-off-by: Tony Lin <tony.lin@intel.com>

---------

Signed-off-by: Tony Lin <tony.lin@intel.com>
* ci: add cpu device e2e test

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci: alias vllm-cpu-nightly dist-info as vllm to fix CLI version lookup

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci: macOS job also installs vllm-cpu-nightly from PyPI

Drop the in-CI git+url build, drop the manual pip cache step (now
handled by setup-python's cache: pip), reuse the same dist-info alias
trick as ubuntu so importlib.metadata.version('vllm') works.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci: tag aliased vllm dist-info with +cpu so platform plugin activates

vllm.platforms.cpu_platform_plugin() decides whether the CPU platform
is available by checking 'cpu' in importlib.metadata.version('vllm').
Our build script strips the +cpu local label before upload (PyPI
rejects local versions), so the alias version was just a date string
without 'cpu', making the plugin return None and 'vllm serve' fail
with 'Failed to infer device type'. Re-tag the alias copy with +cpu;
the original dist-info is untouched.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci(buildkite): use unsafe-best-match for uv when installing vllm-cpu-nightly

uv's default first-index strategy locked setuptools to whatever the
pytorch CPU index serves (<=70.2.0), so vllm-cpu-nightly's pinned
setuptools==80.10.2 could not be satisfied. Tell uv to consider the
full cross-index version pool just for this install.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci(buildkite): alias vllm-cpu-nightly dist-info to vllm with +cpu tag

vllm CLI calls importlib.metadata.version('vllm'), but our wheel
registers as vllm-cpu-nightly so the lookup raises PackageNotFoundError
and 'vllm serve' dies. Same fix already applied in the GH Actions
cpu_device.yml — port it to the buildkite script.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci(buildkite): drop stale CpuCacheContext check in handle transport verify

handle (server-side copy) now goes through ShmTransferStrategy after
the non_gpu_transfer refactor; CpuCacheContext is no longer
instantiated on this path. Match on the actual log line
'Using shm non-GPU transfer strategy' instead.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci(buildkite): verify transport via worker 'Creating transfer context' line

Step 5.5 transport-mode verification was checking server-side strategy
strings, but for handle mode the worker enters HandleTransferContext
which goes through gpu_transfer.py, not non_gpu_transfer.py - so the
shm strategy line never shows up. Switch to grepping the worker's own
'Creating transfer context (device_type=*, mode=*)' log line, which
is the single source of truth for which TransferContext got created.
Also split the previously-conflated 'auto' and 'handle' branches: on
CPU, auto falls back to DataTransferContext, while handle stays as
HandleTransferContext.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci(buildkite): grep handle/auto transport verify in vllm log, not lmcache log

Worker is a child of vllm serve, so its 'Creating transfer context'
line goes to VLLM_LOG (vllm stdout), not LMCACHE_LOG (lmcache server
stdout). Step 5.5 was grepping the wrong file.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci(cpu): factor out shared install/download scripts

Pull the duplicated vLLM-CPU install + dist-info alias, lmcache CPU install and opt-125m download out of cpu_device.yml and run-cpu-e2e-validation.sh into three small scripts under .github/scripts/. Also collapse the ubuntu/macos jobs into a single matrix job, and trim a few overly long inline comments.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* ci(cpu): generalize download script; define wait_for_metric_change

- Rename download_opt125m.sh -> download_model.sh and accept the repo id as a positional arg (or via MODEL_ID), so the script is reusable for other models. - Add the missing wait_for_metric_change helper that run-cpu-e2e-validation.sh has been calling but never defined; previously bash silently swallowed it via '|| true'.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* improve

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Address comment, remove SKIP_CACHE_HIT_VALIDATION env var

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Address comment, move scripts together

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

---------

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Jinwoo Jeong <jwjeong@csl.korea.ac.kr>
Signed-off-by: aeon-x <talexcao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.