feat: fleet-wide L2 usage tracking and quota-based eviction by aeon-x · Pull Request #2 · aeon-x/LMCache

aeon-x · 2026-06-10T23:39:00Z

Summary

Add per-cache_salt L2 usage accounting, quota management, and LRU eviction to the MP coordinator
MP servers report L2 store/lookup events via a batching L2EventListener; the coordinator aggregates usage, enforces quotas, and selects LRU keys to evict
Reuses QuotaManager and IsolatedLRUEvictionPolicy from the distributed layer instead of reimplementing them
Includes REST endpoints (/l2/quota, /l2/events, /l2/status), config flags (--coordinator-l2-event-reporting, --coordinator-l2-event-flush-interval), and a design doc

Key changes

Coordinator L2 subsystem (lmcache/v1/mp_coordinator/l2/): L2UsageManager, L2EvictionManager (wraps IsolatedLRUEvictionPolicy for per-salt LRU), L2EventListener (batching reporter)
Quota management: Reuses QuotaManager from lmcache.v1.distributed.quota_manager (allowlist semantics — unregistered salts default to 0 limit)
Eviction logic: Aligned with L2EvictionController — watermark trigger (usage >= watermark * quota) and eviction by key count ratio (default 0.2)
REST API (lmcache/v1/mp_coordinator/http_apis/l2_api.py): quota CRUD, event ingestion, combined status queries; _default path sentinel maps to empty-string salt
Internal types: ObjectKey used throughout the coordinator; CacheKey only at API boundary for JSON serialization
Schemas (schemas.py): CacheKey, EventType enum, UsageEvent, quota/status models
MP server wiring (http_server.py): creates event listener and registers it on storage manager when l2_event_reporting is enabled
Interface update (L2AdapterListener.on_l2_keys_stored): now passes sizes alongside keys
Config (MPCoordinatorConfig): added trigger_watermark (default 1.0), changed eviction_ratio default from 0.5 to 0.2
Design doc (docs/design/v1/mp_coordinator/l2_usage_and_eviction.md)

Test plan

Unit tests for L2UsageManager, L2EvictionManager (LRU ordering, eviction ratio, watermark trigger, multi-salt independence, no-quota/zero-quota eviction)
Integration tests for L2 REST API (quota CRUD, event ingestion, status queries, validation, _default salt sentinel)
Manual: verify MP server registers events with coordinator when --coordinator-l2-event-reporting is set

🤖 Generated with Claude Code

* feat: add POSIX SHM infra for CPU KV-cache IPC - lmcache/v1/multiprocess/posix_shm.py: thin POSIX-SHM facade (shm_create_readwrite / shm_map_readwrite / shm_munmap / shm_unlink / shm_open_pool_as_mmap) routing through CPython's _posixshmem to avoid macOS EACCES and shutdown BufferError issues - lmcache/v1/platform/cpu/shm.py: CpuShmTensorWrapper + migrate_to_shm_and_wrap for zero-copy CPU KV-cache IPC mirroring CUDA-IPC semantics - lmcache/v1/platform/cpu/__init__.py: self-register cpu factory with platform registry - tests/v1/multiprocess/test_posix_shm.py: unit tests for posix_shm - tests/v1/platform/test_cpu_shm.py: unit tests for CpuShmTensorWrapper Signed-off-by: baoloongmao <baoloongmao@tencent.com> * address comment Signed-off-by: baoloongmao <baoloongmao@tencent.com> * address comment Signed-off-by: baoloongmao <baoloongmao@tencent.com> * assert zero storage_offset before SHM migration Signed-off-by: baoloongmao <baoloongmao@tencent.com> * add warning logs to swallowed exceptions in posix_shm Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>

…nector __str__ (LMCache#3577) Normalize flat/nested block_ids in flat_block_ids and connector __str__ Older vLLM connectors emit a flat list[int] for the single non-hybrid group, while newer ones use nested list[list[int]]. Make flat_block_ids and the three LMCacheMPConnectorMetadata.__str__ paths tolerate both, matching the normalization already done in expand_block_ids_to_views(). Signed-off-by: Tony Lin <tony.lin@intel.com>

Signed-off-by: feixiangpeng <155504520+feixiangpeng@users.noreply.github.com>

…gned KV reuse (LMCache#3582) * [blend-v3] Token-level matching + per-token slot scatter for CB reuse Match fingerprints at token stride (probe_stride=1) and scatter reused KV with the per-token slot kernel (multi_layer_kv_transfer) instead of matching/scattering at vLLM block granularity. This lets CacheBlend reuse non-block-aligned matches, the common case for real workloads where the shared body starts at an arbitrary token offset (a partial vLLM block) rather than a chunk/block boundary. - register_rope: probe_stride = 1 (find matches at any token offset) - cb_unified_lookup: accept non-prefix matches at any cur_st (drop the chunk-alignment filter) - cb_retrieve_pre_computed: per-token slot scatter of the full matched range. Partial vLLM blocks are written per slot, so matched and recomputed tokens sharing a block don't conflict. Removes the block-aligned drop checks and the now-dead whole-block scatter path. Validated on prefix-suffix-tuner (non-block-aligned by construction): ~99% suffix hit, 3.91x TTFT vs full recompute, output matches the full-recompute baseline. The slot kernel is bandwidth-bound and matches the whole-block kernel's throughput (~700 GB/s), so no scatter overhead. Signed-off-by: deng451e <838677410@qq.com> * [blend-v3] Vectorize V3 matcher probe; drop obsolete probe stride Token-level matching (probe_stride=1) had turned match_sub_sequence into an O(tokens) pure-Python probe loop — ~5.7 ms at 32K context, ~7x the old block-stride cost. Replace it with a vectorized direct-address probe (numpy gather over all positions) plus a verify loop over only the surviving hits; the table is sparse (TABLE_SIZE = 2^20 >> registered chunks) so the hit set is tiny. This restores the base class's vectorization that the V3 override had dropped, keeping full-hash collision rejection. Probe stride is now obsolete (we always scan every position), so the _probe_stride field, ctor arg, and register_rope assignment are removed. Matcher microbench (CPU, per lookup): 32K ctx 5.66 -> 0.83 ms (~7x), 20K 3.43 -> 0.52, 8K 1.39 -> 0.23 — back to the pre-token-scatter block-stride baseline with full token-level matching. All 20 test_optimized_lookup_v3 tests pass. Signed-off-by: deng451e <838677410@qq.com> * update Signed-off-by: deng451e <838677410@qq.com> * update stale docstring Signed-off-by: deng451e <838677410@qq.com> --------- Signed-off-by: deng451e <838677410@qq.com>

…brid models (LMCache#3557) Signed-off-by: ApostaC <yihua@tensormesh.ai>

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

Signed-off-by: ApostaC <yihua@tensormesh.ai>

…LMCache#3599) Signed-off-by: ApostaC <yihua@tensormesh.ai>

…che#3592) Signed-off-by: sonimwang <17816198144@163.com>

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

…d_dim/CLA calculation, add i18n UI (LMCache#2834) * examples(kv_cache_calculator): add Hunyuan & DeepSeek models, UI i18n, prefer local modelconfig Signed-off-by: KimmoZAG <995496585@qq.com> * fix(kv_cache_calculator): use prefix match for DeepSeek V3 variants; consolidate head_dim logic Signed-off-by: KimmoZAG <995496585@qq.com> --------- Signed-off-by: KimmoZAG <995496585@qq.com>

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…ocess_adapter (LMCache#3478) Signed-off-by: Yujie Liu <milan021007@163.com>

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

Signed-off-by: ApostaC <yihua@tensormesh.ai>

* feat(mp): add SHM-based NonGpuContext (server-side copy) (LMCache#3346) * feat(mp): add SHM-based NonGpuContext (server-side copy) Porting upstream PR LMCache#3328 (commit 2/2) Adapted to current branch: - non_cuda_equivalents.py changes redirected to python_ops_fallback.py (renamed) - test_cache.py changes redirected to bench/test_cache.py (relocated) - skipped manual registration in cli/commands/__init__.py (now uses dynamic discovery) Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Refactor Signed-off-by: baoloongmao <baoloongmao@tencent.com> * address gemini review on shm + stage_block_ids - shm.py: hook munmap+shm_unlink via weakref.finalize so mmap and SHM segments are released when migrated tensors / to_tensor views are GC-ed - shm.py: stop using id(tensor) as registry key; clear stale entries on finalize and use a monotonic counter for SHM names so id reuse can't trigger EEXIST in shm_open(O_EXCL) - shm.py: use numel*element_size for the wrapped tensor byte count so views of larger storages are sized correctly - cache_context.py: reject empty/None block_ids and bound-check against block_ids_buffer_ in stage_block_ids Signed-off-by: baoloongmao <baoloongmao@tencent.com> * shm: guard fd/mmap with try/finally on error paths ensure shm_create_readwrite / shm_map_readwrite never leak the fd or the mmap when ftruncate or mmap fails. also rename _nbytes to nbytes so the test can read it without poking a private attribute. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * shm: validate cached entry via weakref to defeat id reuse Cache id(tensor)->(weakref, name) instead of id->name. Lookups verify ref() is the same tensor before reusing the cached SHM name; a stale entry left behind by a GC'd tensor whose id has since been recycled now reads as a miss instead of crashing the next migration with EEXIST. Adds inject_stale_cache_entry_for_test so the new regression test can simulate id recycling without poking module-private state. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * address comments Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Address comment Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>

…he#3611) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

…#3433) Signed-off-by: zhengfeihe <hezhengfei1999@gmail.com>

… path (LMCache#3600) * refactor: utilize multi_layer_block_kv_transfer ops for data transfer path Consolidate the data transfer path by utilizing the `multi_layer_block_kv_transfer` operation. This update allows a single op to support both handle and data paths simultaneously, streamlining the underlying transfer logic. Signed-off-by: Tony Lin <tony.lin@intel.com> * more comments for clarity Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor(test): skip blocks-first fused KV tests on non-CPU devices - Apply module-level pytestmark to skip all tests in this file when torch_device_type is not 'cpu', as the blocks-first fused shape (Format 10) is currently CPU-only. - Move pytestmark to the top of the file for better clarity and correct test execution control. Signed-off-by: Tony Lin <tony.lin@intel.com> * add GPUKVFormat.NL_X_NB_NH_BS_TWO_HS in python fallback path Signed-off-by: Tony Lin <tony.lin@intel.com> * properly handle cuda kernel's limitation Signed-off-by: Tony Lin <tony.lin@intel.com> * fix bug Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com>

…3596) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Reduce ci cpu e2e test memory request Signed-off-by: baoloongmao <baoloongmao@tencent.com> * use python to compute kv cache bytes so float values work Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>

Signed-off-by: Javen Ke <javen@arcfra.com>

Signed-off-by: Shaoting-Feng <shaotingf@tensormesh.ai>

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

+    try:
+        _quota_store(request).set(cache_salt, limit_bytes)
+    except ValueError as exc:
+        return JSONResponse(status_code=400, content={"error": str(exc)})


Signed-off-by: aeon-x <talexcao@gmail.com>

* refactor: refactor query cli Signed-off-by: idellzheng <idellzheng@tencent.com> * refactor: refactor trace cli Signed-off-by: idellzheng <idellzheng@tencent.com> * bugfix Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>

…end (LMCache#2418) * [Core] Add multipath KV-cache offloading support in LMCache NIXL backend Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Address feedback: add validate_nixl_path helper function and update NixlFilePool path handling Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Addresses PR feedback for documentation, unit tests, and formatting Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Use metadata.worker_id for path sharding instead of torch.cuda.current_device() For CPU-buffer backends (POSIX, HF3FS), initialize_allocator does not call torch.cuda.set_device(), so torch.cuda.current_device() may return 0 for all workers, defeating multipath sharding. Replace with metadata.worker_id which reliably distinguishes workers regardless of CUDA state. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Use local_worker_id instead of worker_id for path sharding In multi-node deployments, worker_id is the global rank which causes inconsistent path distribution across nodes. local_worker_id is the local GPU ID on the node, ensuring each node's GPUs map to paths consistently (e.g. GPU 0 -> path0, GPU 1 -> path1 on every node). Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Fix code formatting (ruff format) Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Remove redundant assert in NixlFilePool.__init__ validate_nixl_path already checks for None path with a more informative error message, making this assertion unnecessary. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Add test_nixl_multipath.py to Buildkite unit-test ignore list test_nixl_multipath.py imports NixlStorageConfig from nixl_storage_backend.py, which has top-level nixl C extension imports. When the nixl native bindings cannot fully load in the CI environment, this causes an ImportError during pytest collection, and --maxfail=1 immediately aborts the entire test suite. This matches the existing ignore for test_nixl_storage.py. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Rebase NIXL multipath support to use PathSharder - Remove validate_nixl_path method from NixlStorageConfig - Update NixlFilePool to accept PathSharder instance - Update createPool to use PathSharder with buffer_device - Update tests to use PathSharder directly This aligns with PR LMCache#2982 which centralized path sharding logic. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Fix cursor bot issues: nixl_path validation and multipath sharding 1. Add validation to ensure nixl_path is not None - Add assert in from_cache_engine_config to validate nixl_path - Add assert in createPool as additional safeguard - Prevents TypeError when PathSharder receives None value 2. Fix CPU buffer device multipath sharding issue - Pass f'cuda:{metadata.worker_id}' to PathSharder instead of buffer_device - Ensures proper path selection based on worker_id for by_gpu sharding - Agent still uses correct buffer_device for memory allocation These fixes resolve both high-severity issues identified by cursor bot. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Add docstring and warning for nixl createPool path sharding - Document createPool arguments/returns and path sharding behavior - Warn when list paths contain commas (may affect sharding); PathSharder unchanged Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Resolve merge conflicts in nixl_storage_backend.py Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * fix: resolve buildkite pipeline merge conflict Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Fix: Restore missing use_hugepages extraction from merge conflict resolution During the merge conflict resolution in commit 01da799, the use_hugepages extraction line was accidentally deleted. This line is part of the huge pages feature (commit a68bd0a) and is needed for the NIXL backend to properly allocate CPU memory with hugepages support. Changes: - Restore use_hugepages: bool field in NixlStorageConfig dataclass - Restore use_hugepages extraction: extra_config.get("nixl_use_hugepages", False) - Remove unused import sys from test file (auto-fixed by ruff) Signed-off-by: Ugur Kaynar <Ugur.kaynar@dell.com> * [Core] Fix NIXL multipath PR: lint, tests, and dead code Make CI green for the multipath KV-cache offloading change: - createPool: remove the duplicate `elif backend in ("OBJ","AZURE_BLOB")` branch and the unreachable `return NixlFilePool(...)` left over from a merge; collapse back to the single OBJ/AZURE_BLOB/DOCA_MEMOS object-pool branch (no behavior change — OBJ/AZURE_BLOB still get b128=False). - NixlDynamicStorageBackend: reject a list `nixl_path` at init. The dynamic backend uses self.path directly as a single directory, and path sharding across multiple paths is only implemented for static pools. This narrows self.path to str and fixes the three mypy str|list[str] arg-type errors by failing loud instead of silently mishandling a list. - test_nixl_doca_memos: pass the new createPool path_sharding/dst_device args (ignored for object backends) to fix the missing-positional-arg failures. - test_nixl_posix_backend_multipath: use the valid 5-element kv_shape torch.Size([4, 2, 256, 8, 128]) like the other run()-based tests; the previous [2048, 2048] shape crashed in metadata.get_shapes(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Samuel Shen <slshen@tensormesh.ai> * [Core] NIXL: don't require nixl_path for non-file backends The multipath change added an unconditional `assert path is not None, "nixl_path cannot be None"` in NixlStorageConfig.from_cache_engine_config, which broke object/CPU backends that legitimately have no path (e.g. OBJ/DOCA_MEMOS) — this is what caused the test_nixl_shared_pool.py failures on CI. Remove the unconditional assert; the existing conditional check already requires a path only for the file backends that need one: if backend in ("GDS", "GDS_MT", "POSIX", "HF3FS"): assert path is not None, f"nixl_path must be provided for {backend} backend" This restores the pre-PR behavior (path optional for object backends). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Samuel Shen <slshen@tensormesh.ai> --------- Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> Signed-off-by: Ugur Kaynar <Ugur.kaynar@dell.com> Signed-off-by: Emine Ugur Kaynar <Ugur.Kaynar@dell.com> Signed-off-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ave (fixes LMCache#3318) (LMCache#3325) Signed-off-by: abinggo <107740309+abinggo@users.noreply.github.com>

Signed-off-by: aeon-x <talexcao@gmail.com>

…MCache#3607) Signed-off-by: deng451e <838677410@qq.com>

Signed-off-by: aeon-x <talexcao@gmail.com>

…gh) changes (LMCache#3274) * rust: Add io_uring sync write path for checkpoint Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> * enable io_uring from raw block plugin The io_uring changes were omitted during the MP mode integration This commit partially adds them back. The request batching is only done for headers and payloads, since for MP mode we need to order the requests as they may send together. This will be fixed later. Fixes: LMCache#3119 Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * Add nvme helpers to enable io_uring command This adds the required nvme helpers for getting namespace information, lba size etc. to enable io_uring command support. NVMe io_uring command support (io_uring_cmd) enables asynchronous, low-latency passthrough of NVMe commands directly from user space, bypassing file system and most of block layer overhead. Introduced in Linux kernel 5.19, it allows using IORING_OP_URING_CMD for raw NVMe commands, featuring big SQE (128 bytes) / CQE (32 bytes) support for larger command structures. Unlike block device interface it requires nvme namespace character device (/dev/ngXnY) Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Daegyu Han <daegyu94.han@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * rust Add io_uring command read write support NVMe io_uring command utilizes * big Submission Queue Entries 128 bytes, standard is 64 bytes * and, big Completion Queue Entries 32 bytes, standard is 16 bytes. The NVMe command is embedded within the last 80 bytes of the submission queue entry. The io_uring worker thread has been rebased for better readability. The nvme namespace character device doesn't support I/O sizes greater than /sys/block/nvmeXnY/queue/max_hw_sectors_kb Usually the block layer handles I/O split for any requests larger than this limitation, which is not there for char devices To handle this add support for I/O splitting based on user specified maximum data transfer limit. "rust_raw_block.max_data_transfer_size" If not specified the commands will be auto split based to the queue max_hw_sectors_kb limit. Added comprehensive test suite for uring_command. Expanded the raw block l2_adpater tests. Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Daegyu Han <daegyu94.han@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * doc: Add missing io_uring and use_uring_cmd docs Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> * fix (rawblock): callback for succeeded keys before raising error Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> --------- Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: Daegyu Han <daegyu94.han@samsung.com>

…instead of inferring from cache_config.block_size (LMCache#3616) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com>

Signed-off-by: Kushagra963-lab <147275307+Kushagra963-lab@users.noreply.github.com> Co-authored-by: Kushagra963-lab <147275307+Kushagra963-lab@users.noreply.github.com>

* [cli] add quota management commands (set/get/list/delete) Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle error fix Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>

…h) (LMCache#3621) Signed-off-by: deng451e <838677410@qq.com>

Signed-off-by: aeon-x <talexcao@gmail.com>

…he#3404) * feat: add Google Cloud Bigtable remote storage connector - Integrate thread-safe gRPC AsyncPQExecutor for Point-Reads and Batched Mutations - Implement numerical-precision partial chunk reshaping falling back to CPU recompute - Incorporate 20MB MutateRow request payload thresholds and 10s TTLCache local shielding - Include 100% portable dynamic gapic mock-isolated PyTest unit test coverage - Append comprehensive user guide and integration documentation - Add google-cloud-bigtable SDK to common requirements to resolve upstream GitHub Actions CI ImportErrors - Implement graceful FileNotFoundError fallback for credentials_path to ensure 100% Buildkite K3 CI pipeline resilience - Wrap all internal logger and warning strings to strictly comply with LMCache's 88-character ruff cap Signed-off-by: An Nguyen <annenguyen@google.com> Y * update docs & address comment Signed-off-by: An Nguyen <annenguyen@google.com> * add bigtable bench Signed-off-by: An Nguyen <annenguyen@google.com> * fix mypy type checking for bigtable connector mock namespace Signed-off-by: An Nguyen <annenguyen@google.com> * fix: make bigtable max_chunk_size_mb default value consistent and add unit tests Signed-off-by: An Nguyen <annenguyen@google.com> * test: Add Bigtable Emulator integration tests and fix remove_sync context issues Signed-off-by: An Nguyen <annenguyen@google.com> * feat(storage): cache TableAsync and optimize remove_sync to fire-and-forget in Bigtable connector - Lazily initialize and cache TableAsync in BigtableConnector to prevent memory registry leaks. - Optimize remove_sync to be fire-and-forget to avoid blocking caller thread on the critical path, yielding a ~20% throughput improvement. - Update unit tests to poll-wait for background deletion task before asserting. Signed-off-by: An Nguyen <annenguyen@google.com> * test: skip/reuse Bigtable emulator in CI, resolve mock pollution, and fix style lints Signed-off-by: An Nguyen <annenguyen@google.com> --------- Signed-off-by: An Nguyen <annenguyen@google.com>

Signed-off-by: ApostaC <yihua@tensormesh.ai>

…mba models (LMCache#3645) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Signed-off-by: aeon-x <talexcao@gmail.com>

…notable speedup (LMCache#3591) * Perf: optimize Python fallback block transfer for 3x speedup - Optimize fallback block-id and D2H staging overhead - Restructure per-layer transfer loops to iterate over objects first then layers Signed-off-by: Tony Lin <tony.lin@intel.com> * apply gemini's suggestion Signed-off-by: Tony Lin <tony.lin@intel.com> * optimize flash_infer block transfer paths in python fallback Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com>

Signed-off-by: deng451e <838677410@qq.com>

…e#3647) Signed-off-by: royyhuang <roy.y.huang@gmail.com>

* ci: add cpu device e2e test Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci: alias vllm-cpu-nightly dist-info as vllm to fix CLI version lookup Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci: macOS job also installs vllm-cpu-nightly from PyPI Drop the in-CI git+url build, drop the manual pip cache step (now handled by setup-python's cache: pip), reuse the same dist-info alias trick as ubuntu so importlib.metadata.version('vllm') works. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci: tag aliased vllm dist-info with +cpu so platform plugin activates vllm.platforms.cpu_platform_plugin() decides whether the CPU platform is available by checking 'cpu' in importlib.metadata.version('vllm'). Our build script strips the +cpu local label before upload (PyPI rejects local versions), so the alias version was just a date string without 'cpu', making the plugin return None and 'vllm serve' fail with 'Failed to infer device type'. Re-tag the alias copy with +cpu; the original dist-info is untouched. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci(buildkite): use unsafe-best-match for uv when installing vllm-cpu-nightly uv's default first-index strategy locked setuptools to whatever the pytorch CPU index serves (<=70.2.0), so vllm-cpu-nightly's pinned setuptools==80.10.2 could not be satisfied. Tell uv to consider the full cross-index version pool just for this install. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci(buildkite): alias vllm-cpu-nightly dist-info to vllm with +cpu tag vllm CLI calls importlib.metadata.version('vllm'), but our wheel registers as vllm-cpu-nightly so the lookup raises PackageNotFoundError and 'vllm serve' dies. Same fix already applied in the GH Actions cpu_device.yml — port it to the buildkite script. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci(buildkite): drop stale CpuCacheContext check in handle transport verify handle (server-side copy) now goes through ShmTransferStrategy after the non_gpu_transfer refactor; CpuCacheContext is no longer instantiated on this path. Match on the actual log line 'Using shm non-GPU transfer strategy' instead. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci(buildkite): verify transport via worker 'Creating transfer context' line Step 5.5 transport-mode verification was checking server-side strategy strings, but for handle mode the worker enters HandleTransferContext which goes through gpu_transfer.py, not non_gpu_transfer.py - so the shm strategy line never shows up. Switch to grepping the worker's own 'Creating transfer context (device_type=*, mode=*)' log line, which is the single source of truth for which TransferContext got created. Also split the previously-conflated 'auto' and 'handle' branches: on CPU, auto falls back to DataTransferContext, while handle stays as HandleTransferContext. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci(buildkite): grep handle/auto transport verify in vllm log, not lmcache log Worker is a child of vllm serve, so its 'Creating transfer context' line goes to VLLM_LOG (vllm stdout), not LMCACHE_LOG (lmcache server stdout). Step 5.5 was grepping the wrong file. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci(cpu): factor out shared install/download scripts Pull the duplicated vLLM-CPU install + dist-info alias, lmcache CPU install and opt-125m download out of cpu_device.yml and run-cpu-e2e-validation.sh into three small scripts under .github/scripts/. Also collapse the ubuntu/macos jobs into a single matrix job, and trim a few overly long inline comments. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * ci(cpu): generalize download script; define wait_for_metric_change - Rename download_opt125m.sh -> download_model.sh and accept the repo id as a positional arg (or via MODEL_ID), so the script is reusable for other models. - Add the missing wait_for_metric_change helper that run-cpu-e2e-validation.sh has been calling but never defined; previously bash silently swallowed it via '|| true'. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * improve Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Address comment, remove SKIP_CACHE_HIT_VALIDATION env var Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Address comment, move scripts together Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>

Signed-off-by: Jinwoo Jeong <jwjeong@csl.korea.ac.kr>

Signed-off-by: aeon-x <talexcao@gmail.com>

maobaolong and others added 25 commits June 7, 2026 08:42

Added HFbucket MP (LMCache#3263)

3a45d0f

Signed-off-by: feixiangpeng <155504520+feixiangpeng@users.noreply.github.com>

[Core][MP] refactor the LMCache layer group for better compat with hy…

20cf3cd

…brid models (LMCache#3557) Signed-off-by: ApostaC <yihua@tensormesh.ai>

[GPUKVFormat]: support vLLM CPU 2-fused KV layout (LMCache#3567)

874f81b

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

[Refactor] Rename LMCacheGroupView to EngineGroupInfo (LMCache#3598)

936bb94

Signed-off-by: ApostaC <yihua@tensormesh.ai>

[Refactor] Change the report_status to be per-kernel-group in LMCache (…

bf1a215

…LMCache#3599) Signed-off-by: ApostaC <yihua@tensormesh.ai>

fix(zh_CN): correct machine translation errors in documentation (LMCa…

cb193c7

…che#3592) Signed-off-by: sonimwang <17816198144@163.com>

[CI] Improve CI stability: gemma-4 test & serde test (LMCache#3556)

ae328a6

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Update Chinese documentation translations (LMCache#3588)

fe8fb9d

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

[Refactor] Consolidate ParallelStrategy construction in vllm_multi_pr…

068578f

…ocess_adapter (LMCache#3478) Signed-off-by: Yujie Liu <milan021007@163.com>

fix: handle NL_X_NB_NH_BS_TWO_HS in get_group_data_ptrs (LMCache#3602)

a64ef9a

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

[Core][MP] Introduce object_group_id into the ObjectKey (LMCache#3608)

36baf62

Signed-off-by: ApostaC <yihua@tensormesh.ai>

[Feat] Print LMCache startup banner in CLI and vLLM connectors (LMCac…

2cb0bc1

…he#3611) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[Doc] Auto-select model in CPU-offloading example to fit GPU (LMCache…

d4c16f8

…#3433) Signed-off-by: zhengfeihe <hezhengfei1999@gmail.com>

chore(deps): bump sphinxcontrib-mermaid from 1.2.2 to 2.0.2 (LMCache#…

566698e

…3596) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

bench: support aligned L1 buffers for L2 adapters (LMCache#3603)

a5b7047

Signed-off-by: Javen Ke <javen@arcfra.com>

[core] Add GDS L1 tier (cuFile DMA) for MP mode (LMCache#3589)

4bbfd11

Signed-off-by: Shaoting-Feng <shaotingf@tensormesh.ai>

[Core][MP] Support Mamba/GDN hybrid models (Qwen3.5) (LMCache#3613)

65c2ae8

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

(fix) Add missing enum to GPUVKFormat (LMCache#3606)

fca2e49

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Comment thread lmcache/v1/mp_coordinator/http_apis/l2_api.py Outdated

try:

_quota_store(request).set(cache_salt, limit_bytes)

except ValueError as exc:

return JSONResponse(status_code=400, content={"error": str(exc)})

init

efa6900

Signed-off-by: aeon-x <talexcao@gmail.com>

aeon-x force-pushed the feat/l2-usage-tracking-eviction branch from 480228a to efa6900 Compare June 10, 2026 23:46

chunxiaozheng and others added 2 commits June 10, 2026 23:52

abinggo and others added 21 commits June 11, 2026 02:26

fix(v1): graceful skip on slot_mapping/token_ids desync in wait_for_s…

04ea5fb

…ave (fixes LMCache#3318) (LMCache#3325) Signed-off-by: abinggo <107740309+abinggo@users.noreply.github.com>

renaming

fc9df67

Signed-off-by: aeon-x <talexcao@gmail.com>

change naming again

f1897fe

Signed-off-by: aeon-x <talexcao@gmail.com>

[observability] blend server trace sub-spans + V3 hit-rate breakdown (L…

113043d

…MCache#3607) Signed-off-by: deng451e <838677410@qq.com>

add comments

27f4eee

Signed-off-by: aeon-x <talexcao@gmail.com>

fix info leak

3f0313f

Signed-off-by: aeon-x <talexcao@gmail.com>

Merge branch 'dev' into feat/l2-usage-tracking-eviction

05768aa

fix data race

d424ba8

Signed-off-by: aeon-x <talexcao@gmail.com>

[Core] implement per-group tokens_per_chunk and slots_per_chunk, …

8efc6a6

…instead of inferring from cache_config.block_size (LMCache#3616) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

[Misc] align MP server id with OTel service.instance.id (LMCache#3558)

1d2b3de

Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com>

docs: add filesystem connector backend guide (LMCache#3534)

45c02cb

Signed-off-by: Kushagra963-lab <147275307+Kushagra963-lab@users.noreply.github.com> Co-authored-by: Kushagra963-lab <147275307+Kushagra963-lab@users.noreply.github.com>

Merge branch 'dev' into feat/l2-usage-tracking-eviction

0070790

[CI] cu129 images: pin vllm to the cu129 index (drop unsafe-best-matc…

045a0a9

…h) (LMCache#3621) Signed-off-by: deng451e <838677410@qq.com>

fix UTs

e115420

Signed-off-by: aeon-x <talexcao@gmail.com>

fix UT

7bdb8fd

Signed-off-by: aeon-x <talexcao@gmail.com>

[Core][MP] Optimize DSV4 store/load size (LMCache#3635)

83164de

Signed-off-by: ApostaC <yihua@tensormesh.ai>

[Recipe] Recipe update for Qwen 3.6 27B, and general guideline for ma…

08c93df

…mba models (LMCache#3645) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

fix comments

a8a21ff

Signed-off-by: aeon-x <talexcao@gmail.com>

aeon-x force-pushed the feat/l2-usage-tracking-eviction branch from 819ad30 to a8a21ff Compare June 11, 2026 23:26

aeon-x and others added 8 commits June 11, 2026 16:51

fix UT

147b235

Signed-off-by: aeon-x <talexcao@gmail.com>

[fix ]correct retrieve log label prefix -> non_shifted (LMCache#3648)

8622fa2

Signed-off-by: deng451e <838677410@qq.com>

fix(operator): emit --engine-type blend for CacheBlend engine (LMCach…

549b007

…e#3647) Signed-off-by: royyhuang <roy.y.huang@gmail.com>

fix(nixl): create storage directory if it doesn't exist (LMCache#3568)

bc245d9

Signed-off-by: Jinwoo Jeong <jwjeong@csl.korea.ac.kr>

Merge branch 'dev' into feat/l2-usage-tracking-eviction

a7d9b5a

fix comments

a2bb041

Signed-off-by: aeon-x <talexcao@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fleet-wide L2 usage tracking and quota-based eviction#2

feat: fleet-wide L2 usage tracking and quota-based eviction#2
aeon-x wants to merge 60 commits into
devfrom
feat/l2-usage-tracking-eviction

aeon-x commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

aeon-x commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

aeon-x commented Jun 10, 2026 •

edited

Loading