feat: fleet-wide L2 usage tracking and quota-based eviction#2
Open
aeon-x wants to merge 60 commits into
Open
Conversation
* feat: add POSIX SHM infra for CPU KV-cache IPC - lmcache/v1/multiprocess/posix_shm.py: thin POSIX-SHM facade (shm_create_readwrite / shm_map_readwrite / shm_munmap / shm_unlink / shm_open_pool_as_mmap) routing through CPython's _posixshmem to avoid macOS EACCES and shutdown BufferError issues - lmcache/v1/platform/cpu/shm.py: CpuShmTensorWrapper + migrate_to_shm_and_wrap for zero-copy CPU KV-cache IPC mirroring CUDA-IPC semantics - lmcache/v1/platform/cpu/__init__.py: self-register cpu factory with platform registry - tests/v1/multiprocess/test_posix_shm.py: unit tests for posix_shm - tests/v1/platform/test_cpu_shm.py: unit tests for CpuShmTensorWrapper Signed-off-by: baoloongmao <baoloongmao@tencent.com> * address comment Signed-off-by: baoloongmao <baoloongmao@tencent.com> * address comment Signed-off-by: baoloongmao <baoloongmao@tencent.com> * assert zero storage_offset before SHM migration Signed-off-by: baoloongmao <baoloongmao@tencent.com> * add warning logs to swallowed exceptions in posix_shm Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…nector __str__ (LMCache#3577) Normalize flat/nested block_ids in flat_block_ids and connector __str__ Older vLLM connectors emit a flat list[int] for the single non-hybrid group, while newer ones use nested list[list[int]]. Make flat_block_ids and the three LMCacheMPConnectorMetadata.__str__ paths tolerate both, matching the normalization already done in expand_block_ids_to_views(). Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: feixiangpeng <155504520+feixiangpeng@users.noreply.github.com>
…gned KV reuse (LMCache#3582) * [blend-v3] Token-level matching + per-token slot scatter for CB reuse Match fingerprints at token stride (probe_stride=1) and scatter reused KV with the per-token slot kernel (multi_layer_kv_transfer) instead of matching/scattering at vLLM block granularity. This lets CacheBlend reuse non-block-aligned matches, the common case for real workloads where the shared body starts at an arbitrary token offset (a partial vLLM block) rather than a chunk/block boundary. - register_rope: probe_stride = 1 (find matches at any token offset) - cb_unified_lookup: accept non-prefix matches at any cur_st (drop the chunk-alignment filter) - cb_retrieve_pre_computed: per-token slot scatter of the full matched range. Partial vLLM blocks are written per slot, so matched and recomputed tokens sharing a block don't conflict. Removes the block-aligned drop checks and the now-dead whole-block scatter path. Validated on prefix-suffix-tuner (non-block-aligned by construction): ~99% suffix hit, 3.91x TTFT vs full recompute, output matches the full-recompute baseline. The slot kernel is bandwidth-bound and matches the whole-block kernel's throughput (~700 GB/s), so no scatter overhead. Signed-off-by: deng451e <838677410@qq.com> * [blend-v3] Vectorize V3 matcher probe; drop obsolete probe stride Token-level matching (probe_stride=1) had turned match_sub_sequence into an O(tokens) pure-Python probe loop — ~5.7 ms at 32K context, ~7x the old block-stride cost. Replace it with a vectorized direct-address probe (numpy gather over all positions) plus a verify loop over only the surviving hits; the table is sparse (TABLE_SIZE = 2^20 >> registered chunks) so the hit set is tiny. This restores the base class's vectorization that the V3 override had dropped, keeping full-hash collision rejection. Probe stride is now obsolete (we always scan every position), so the _probe_stride field, ctor arg, and register_rope assignment are removed. Matcher microbench (CPU, per lookup): 32K ctx 5.66 -> 0.83 ms (~7x), 20K 3.43 -> 0.52, 8K 1.39 -> 0.23 — back to the pre-token-scatter block-stride baseline with full token-level matching. All 20 test_optimized_lookup_v3 tests pass. Signed-off-by: deng451e <838677410@qq.com> * update Signed-off-by: deng451e <838677410@qq.com> * update stale docstring Signed-off-by: deng451e <838677410@qq.com> --------- Signed-off-by: deng451e <838677410@qq.com>
…brid models (LMCache#3557) Signed-off-by: ApostaC <yihua@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: ApostaC <yihua@tensormesh.ai>
…LMCache#3599) Signed-off-by: ApostaC <yihua@tensormesh.ai>
…che#3592) Signed-off-by: sonimwang <17816198144@163.com>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…d_dim/CLA calculation, add i18n UI (LMCache#2834) * examples(kv_cache_calculator): add Hunyuan & DeepSeek models, UI i18n, prefer local modelconfig Signed-off-by: KimmoZAG <995496585@qq.com> * fix(kv_cache_calculator): use prefix match for DeepSeek V3 variants; consolidate head_dim logic Signed-off-by: KimmoZAG <995496585@qq.com> --------- Signed-off-by: KimmoZAG <995496585@qq.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
…ocess_adapter (LMCache#3478) Signed-off-by: Yujie Liu <milan021007@163.com>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: ApostaC <yihua@tensormesh.ai>
* feat(mp): add SHM-based NonGpuContext (server-side copy) (LMCache#3346) * feat(mp): add SHM-based NonGpuContext (server-side copy) Porting upstream PR LMCache#3328 (commit 2/2) Adapted to current branch: - non_cuda_equivalents.py changes redirected to python_ops_fallback.py (renamed) - test_cache.py changes redirected to bench/test_cache.py (relocated) - skipped manual registration in cli/commands/__init__.py (now uses dynamic discovery) Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Refactor Signed-off-by: baoloongmao <baoloongmao@tencent.com> * address gemini review on shm + stage_block_ids - shm.py: hook munmap+shm_unlink via weakref.finalize so mmap and SHM segments are released when migrated tensors / to_tensor views are GC-ed - shm.py: stop using id(tensor) as registry key; clear stale entries on finalize and use a monotonic counter for SHM names so id reuse can't trigger EEXIST in shm_open(O_EXCL) - shm.py: use numel*element_size for the wrapped tensor byte count so views of larger storages are sized correctly - cache_context.py: reject empty/None block_ids and bound-check against block_ids_buffer_ in stage_block_ids Signed-off-by: baoloongmao <baoloongmao@tencent.com> * shm: guard fd/mmap with try/finally on error paths ensure shm_create_readwrite / shm_map_readwrite never leak the fd or the mmap when ftruncate or mmap fails. also rename _nbytes to nbytes so the test can read it without poking a private attribute. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * shm: validate cached entry via weakref to defeat id reuse Cache id(tensor)->(weakref, name) instead of id->name. Lookups verify ref() is the same tensor before reusing the cached SHM name; a stale entry left behind by a GC'd tensor whose id has since been recycled now reads as a miss instead of crashing the next migration with EEXIST. Adds inject_stale_cache_entry_for_test so the new regression test can simulate id recycling without poking module-private state. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * address comments Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Address comment Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…he#3611) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
…#3433) Signed-off-by: zhengfeihe <hezhengfei1999@gmail.com>
… path (LMCache#3600) * refactor: utilize multi_layer_block_kv_transfer ops for data transfer path Consolidate the data transfer path by utilizing the `multi_layer_block_kv_transfer` operation. This update allows a single op to support both handle and data paths simultaneously, streamlining the underlying transfer logic. Signed-off-by: Tony Lin <tony.lin@intel.com> * more comments for clarity Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor(test): skip blocks-first fused KV tests on non-CPU devices - Apply module-level pytestmark to skip all tests in this file when torch_device_type is not 'cpu', as the blocks-first fused shape (Format 10) is currently CPU-only. - Move pytestmark to the top of the file for better clarity and correct test execution control. Signed-off-by: Tony Lin <tony.lin@intel.com> * add GPUKVFormat.NL_X_NB_NH_BS_TWO_HS in python fallback path Signed-off-by: Tony Lin <tony.lin@intel.com> * properly handle cuda kernel's limitation Signed-off-by: Tony Lin <tony.lin@intel.com> * fix bug Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com>
…3596) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Reduce ci cpu e2e test memory request Signed-off-by: baoloongmao <baoloongmao@tencent.com> * use python to compute kv cache bytes so float values work Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Javen Ke <javen@arcfra.com>
Signed-off-by: Shaoting-Feng <shaotingf@tensormesh.ai>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
| try: | ||
| _quota_store(request).set(cache_salt, limit_bytes) | ||
| except ValueError as exc: | ||
| return JSONResponse(status_code=400, content={"error": str(exc)}) |
480228a to
efa6900
Compare
* refactor: refactor query cli Signed-off-by: idellzheng <idellzheng@tencent.com> * refactor: refactor trace cli Signed-off-by: idellzheng <idellzheng@tencent.com> * bugfix Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>
…end (LMCache#2418) * [Core] Add multipath KV-cache offloading support in LMCache NIXL backend Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Address feedback: add validate_nixl_path helper function and update NixlFilePool path handling Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Addresses PR feedback for documentation, unit tests, and formatting Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Use metadata.worker_id for path sharding instead of torch.cuda.current_device() For CPU-buffer backends (POSIX, HF3FS), initialize_allocator does not call torch.cuda.set_device(), so torch.cuda.current_device() may return 0 for all workers, defeating multipath sharding. Replace with metadata.worker_id which reliably distinguishes workers regardless of CUDA state. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Use local_worker_id instead of worker_id for path sharding In multi-node deployments, worker_id is the global rank which causes inconsistent path distribution across nodes. local_worker_id is the local GPU ID on the node, ensuring each node's GPUs map to paths consistently (e.g. GPU 0 -> path0, GPU 1 -> path1 on every node). Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Fix code formatting (ruff format) Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Remove redundant assert in NixlFilePool.__init__ validate_nixl_path already checks for None path with a more informative error message, making this assertion unnecessary. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Add test_nixl_multipath.py to Buildkite unit-test ignore list test_nixl_multipath.py imports NixlStorageConfig from nixl_storage_backend.py, which has top-level nixl C extension imports. When the nixl native bindings cannot fully load in the CI environment, this causes an ImportError during pytest collection, and --maxfail=1 immediately aborts the entire test suite. This matches the existing ignore for test_nixl_storage.py. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Rebase NIXL multipath support to use PathSharder - Remove validate_nixl_path method from NixlStorageConfig - Update NixlFilePool to accept PathSharder instance - Update createPool to use PathSharder with buffer_device - Update tests to use PathSharder directly This aligns with PR LMCache#2982 which centralized path sharding logic. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Fix cursor bot issues: nixl_path validation and multipath sharding 1. Add validation to ensure nixl_path is not None - Add assert in from_cache_engine_config to validate nixl_path - Add assert in createPool as additional safeguard - Prevents TypeError when PathSharder receives None value 2. Fix CPU buffer device multipath sharding issue - Pass f'cuda:{metadata.worker_id}' to PathSharder instead of buffer_device - Ensures proper path selection based on worker_id for by_gpu sharding - Agent still uses correct buffer_device for memory allocation These fixes resolve both high-severity issues identified by cursor bot. Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Add docstring and warning for nixl createPool path sharding - Document createPool arguments/returns and path sharding behavior - Warn when list paths contain commas (may affect sharding); PathSharder unchanged Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Resolve merge conflicts in nixl_storage_backend.py Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * fix: resolve buildkite pipeline merge conflict Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> * Fix: Restore missing use_hugepages extraction from merge conflict resolution During the merge conflict resolution in commit 01da799, the use_hugepages extraction line was accidentally deleted. This line is part of the huge pages feature (commit a68bd0a) and is needed for the NIXL backend to properly allocate CPU memory with hugepages support. Changes: - Restore use_hugepages: bool field in NixlStorageConfig dataclass - Restore use_hugepages extraction: extra_config.get("nixl_use_hugepages", False) - Remove unused import sys from test file (auto-fixed by ruff) Signed-off-by: Ugur Kaynar <Ugur.kaynar@dell.com> * [Core] Fix NIXL multipath PR: lint, tests, and dead code Make CI green for the multipath KV-cache offloading change: - createPool: remove the duplicate `elif backend in ("OBJ","AZURE_BLOB")` branch and the unreachable `return NixlFilePool(...)` left over from a merge; collapse back to the single OBJ/AZURE_BLOB/DOCA_MEMOS object-pool branch (no behavior change — OBJ/AZURE_BLOB still get b128=False). - NixlDynamicStorageBackend: reject a list `nixl_path` at init. The dynamic backend uses self.path directly as a single directory, and path sharding across multiple paths is only implemented for static pools. This narrows self.path to str and fixes the three mypy str|list[str] arg-type errors by failing loud instead of silently mishandling a list. - test_nixl_doca_memos: pass the new createPool path_sharding/dst_device args (ignored for object backends) to fix the missing-positional-arg failures. - test_nixl_posix_backend_multipath: use the valid 5-element kv_shape torch.Size([4, 2, 256, 8, 128]) like the other run()-based tests; the previous [2048, 2048] shape crashed in metadata.get_shapes(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Samuel Shen <slshen@tensormesh.ai> * [Core] NIXL: don't require nixl_path for non-file backends The multipath change added an unconditional `assert path is not None, "nixl_path cannot be None"` in NixlStorageConfig.from_cache_engine_config, which broke object/CPU backends that legitimately have no path (e.g. OBJ/DOCA_MEMOS) — this is what caused the test_nixl_shared_pool.py failures on CI. Remove the unconditional assert; the existing conditional check already requires a path only for the file backends that need one: if backend in ("GDS", "GDS_MT", "POSIX", "HF3FS"): assert path is not None, f"nixl_path must be provided for {backend} backend" This restores the pre-PR behavior (path optional for object backends). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Samuel Shen <slshen@tensormesh.ai> --------- Signed-off-by: Ugur Kaynar <Ugur.Kaynar@dell.com> Signed-off-by: Ugur Kaynar <Ugur.kaynar@dell.com> Signed-off-by: Emine Ugur Kaynar <Ugur.Kaynar@dell.com> Signed-off-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ave (fixes LMCache#3318) (LMCache#3325) Signed-off-by: abinggo <107740309+abinggo@users.noreply.github.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
…MCache#3607) Signed-off-by: deng451e <838677410@qq.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: aeon-x <talexcao@gmail.com>
…gh) changes (LMCache#3274) * rust: Add io_uring sync write path for checkpoint Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> * enable io_uring from raw block plugin The io_uring changes were omitted during the MP mode integration This commit partially adds them back. The request batching is only done for headers and payloads, since for MP mode we need to order the requests as they may send together. This will be fixed later. Fixes: LMCache#3119 Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * Add nvme helpers to enable io_uring command This adds the required nvme helpers for getting namespace information, lba size etc. to enable io_uring command support. NVMe io_uring command support (io_uring_cmd) enables asynchronous, low-latency passthrough of NVMe commands directly from user space, bypassing file system and most of block layer overhead. Introduced in Linux kernel 5.19, it allows using IORING_OP_URING_CMD for raw NVMe commands, featuring big SQE (128 bytes) / CQE (32 bytes) support for larger command structures. Unlike block device interface it requires nvme namespace character device (/dev/ngXnY) Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Daegyu Han <daegyu94.han@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * rust Add io_uring command read write support NVMe io_uring command utilizes * big Submission Queue Entries 128 bytes, standard is 64 bytes * and, big Completion Queue Entries 32 bytes, standard is 16 bytes. The NVMe command is embedded within the last 80 bytes of the submission queue entry. The io_uring worker thread has been rebased for better readability. The nvme namespace character device doesn't support I/O sizes greater than /sys/block/nvmeXnY/queue/max_hw_sectors_kb Usually the block layer handles I/O split for any requests larger than this limitation, which is not there for char devices To handle this add support for I/O splitting based on user specified maximum data transfer limit. "rust_raw_block.max_data_transfer_size" If not specified the commands will be auto split based to the queue max_hw_sectors_kb limit. Added comprehensive test suite for uring_command. Expanded the raw block l2_adpater tests. Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Daegyu Han <daegyu94.han@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * doc: Add missing io_uring and use_uring_cmd docs Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> * fix (rawblock): callback for succeeded keys before raising error Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> --------- Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: Daegyu Han <daegyu94.han@samsung.com>
…instead of inferring from cache_config.block_size (LMCache#3616) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Rui Zhang <zrfishnoodles@gmail.com>
Signed-off-by: Kushagra963-lab <147275307+Kushagra963-lab@users.noreply.github.com> Co-authored-by: Kushagra963-lab <147275307+Kushagra963-lab@users.noreply.github.com>
* [cli] add quota management commands (set/get/list/delete) Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle error fix Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>
…h) (LMCache#3621) Signed-off-by: deng451e <838677410@qq.com>
…he#3404) * feat: add Google Cloud Bigtable remote storage connector - Integrate thread-safe gRPC AsyncPQExecutor for Point-Reads and Batched Mutations - Implement numerical-precision partial chunk reshaping falling back to CPU recompute - Incorporate 20MB MutateRow request payload thresholds and 10s TTLCache local shielding - Include 100% portable dynamic gapic mock-isolated PyTest unit test coverage - Append comprehensive user guide and integration documentation - Add google-cloud-bigtable SDK to common requirements to resolve upstream GitHub Actions CI ImportErrors - Implement graceful FileNotFoundError fallback for credentials_path to ensure 100% Buildkite K3 CI pipeline resilience - Wrap all internal logger and warning strings to strictly comply with LMCache's 88-character ruff cap Signed-off-by: An Nguyen <annenguyen@google.com> Y * update docs & address comment Signed-off-by: An Nguyen <annenguyen@google.com> * add bigtable bench Signed-off-by: An Nguyen <annenguyen@google.com> * fix mypy type checking for bigtable connector mock namespace Signed-off-by: An Nguyen <annenguyen@google.com> * fix: make bigtable max_chunk_size_mb default value consistent and add unit tests Signed-off-by: An Nguyen <annenguyen@google.com> * test: Add Bigtable Emulator integration tests and fix remove_sync context issues Signed-off-by: An Nguyen <annenguyen@google.com> * feat(storage): cache TableAsync and optimize remove_sync to fire-and-forget in Bigtable connector - Lazily initialize and cache TableAsync in BigtableConnector to prevent memory registry leaks. - Optimize remove_sync to be fire-and-forget to avoid blocking caller thread on the critical path, yielding a ~20% throughput improvement. - Update unit tests to poll-wait for background deletion task before asserting. Signed-off-by: An Nguyen <annenguyen@google.com> * test: skip/reuse Bigtable emulator in CI, resolve mock pollution, and fix style lints Signed-off-by: An Nguyen <annenguyen@google.com> --------- Signed-off-by: An Nguyen <annenguyen@google.com>
Signed-off-by: ApostaC <yihua@tensormesh.ai>
…mba models (LMCache#3645) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: aeon-x <talexcao@gmail.com>
819ad30 to
a8a21ff
Compare
…notable speedup (LMCache#3591) * Perf: optimize Python fallback block transfer for 3x speedup - Optimize fallback block-id and D2H staging overhead - Restructure per-layer transfer loops to iterate over objects first then layers Signed-off-by: Tony Lin <tony.lin@intel.com> * apply gemini's suggestion Signed-off-by: Tony Lin <tony.lin@intel.com> * optimize flash_infer block transfer paths in python fallback Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: deng451e <838677410@qq.com>
…e#3647) Signed-off-by: royyhuang <roy.y.huang@gmail.com>
* ci: add cpu device e2e test
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci: alias vllm-cpu-nightly dist-info as vllm to fix CLI version lookup
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci: macOS job also installs vllm-cpu-nightly from PyPI
Drop the in-CI git+url build, drop the manual pip cache step (now
handled by setup-python's cache: pip), reuse the same dist-info alias
trick as ubuntu so importlib.metadata.version('vllm') works.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci: tag aliased vllm dist-info with +cpu so platform plugin activates
vllm.platforms.cpu_platform_plugin() decides whether the CPU platform
is available by checking 'cpu' in importlib.metadata.version('vllm').
Our build script strips the +cpu local label before upload (PyPI
rejects local versions), so the alias version was just a date string
without 'cpu', making the plugin return None and 'vllm serve' fail
with 'Failed to infer device type'. Re-tag the alias copy with +cpu;
the original dist-info is untouched.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci(buildkite): use unsafe-best-match for uv when installing vllm-cpu-nightly
uv's default first-index strategy locked setuptools to whatever the
pytorch CPU index serves (<=70.2.0), so vllm-cpu-nightly's pinned
setuptools==80.10.2 could not be satisfied. Tell uv to consider the
full cross-index version pool just for this install.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci(buildkite): alias vllm-cpu-nightly dist-info to vllm with +cpu tag
vllm CLI calls importlib.metadata.version('vllm'), but our wheel
registers as vllm-cpu-nightly so the lookup raises PackageNotFoundError
and 'vllm serve' dies. Same fix already applied in the GH Actions
cpu_device.yml — port it to the buildkite script.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci(buildkite): drop stale CpuCacheContext check in handle transport verify
handle (server-side copy) now goes through ShmTransferStrategy after
the non_gpu_transfer refactor; CpuCacheContext is no longer
instantiated on this path. Match on the actual log line
'Using shm non-GPU transfer strategy' instead.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci(buildkite): verify transport via worker 'Creating transfer context' line
Step 5.5 transport-mode verification was checking server-side strategy
strings, but for handle mode the worker enters HandleTransferContext
which goes through gpu_transfer.py, not non_gpu_transfer.py - so the
shm strategy line never shows up. Switch to grepping the worker's own
'Creating transfer context (device_type=*, mode=*)' log line, which
is the single source of truth for which TransferContext got created.
Also split the previously-conflated 'auto' and 'handle' branches: on
CPU, auto falls back to DataTransferContext, while handle stays as
HandleTransferContext.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci(buildkite): grep handle/auto transport verify in vllm log, not lmcache log
Worker is a child of vllm serve, so its 'Creating transfer context'
line goes to VLLM_LOG (vllm stdout), not LMCACHE_LOG (lmcache server
stdout). Step 5.5 was grepping the wrong file.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci(cpu): factor out shared install/download scripts
Pull the duplicated vLLM-CPU install + dist-info alias, lmcache CPU install and opt-125m download out of cpu_device.yml and run-cpu-e2e-validation.sh into three small scripts under .github/scripts/. Also collapse the ubuntu/macos jobs into a single matrix job, and trim a few overly long inline comments.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* ci(cpu): generalize download script; define wait_for_metric_change
- Rename download_opt125m.sh -> download_model.sh and accept the repo id as a positional arg (or via MODEL_ID), so the script is reusable for other models. - Add the missing wait_for_metric_change helper that run-cpu-e2e-validation.sh has been calling but never defined; previously bash silently swallowed it via '|| true'.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* improve
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* Address comment, remove SKIP_CACHE_HIT_VALIDATION env var
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* Address comment, move scripts together
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
---------
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Jinwoo Jeong <jwjeong@csl.korea.ac.kr>
Signed-off-by: aeon-x <talexcao@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cache_saltL2 usage accounting, quota management, and LRU eviction to the MP coordinatorL2EventListener; the coordinator aggregates usage, enforces quotas, and selects LRU keys to evictQuotaManagerandIsolatedLRUEvictionPolicyfrom the distributed layer instead of reimplementing them/l2/quota,/l2/events,/l2/status), config flags (--coordinator-l2-event-reporting,--coordinator-l2-event-flush-interval), and a design docKey changes
lmcache/v1/mp_coordinator/l2/):L2UsageManager,L2EvictionManager(wrapsIsolatedLRUEvictionPolicyfor per-salt LRU),L2EventListener(batching reporter)QuotaManagerfromlmcache.v1.distributed.quota_manager(allowlist semantics — unregistered salts default to 0 limit)L2EvictionController— watermark trigger (usage >= watermark * quota) and eviction by key count ratio (default 0.2)lmcache/v1/mp_coordinator/http_apis/l2_api.py): quota CRUD, event ingestion, combined status queries;_defaultpath sentinel maps to empty-string saltObjectKeyused throughout the coordinator;CacheKeyonly at API boundary for JSON serializationschemas.py):CacheKey,EventTypeenum,UsageEvent, quota/status modelshttp_server.py): creates event listener and registers it on storage manager whenl2_event_reportingis enabledL2AdapterListener.on_l2_keys_stored): now passessizesalongside keysMPCoordinatorConfig): addedtrigger_watermark(default 1.0), changedeviction_ratiodefault from 0.5 to 0.2docs/design/v1/mp_coordinator/l2_usage_and_eviction.md)Test plan
L2UsageManager,L2EvictionManager(LRU ordering, eviction ratio, watermark trigger, multi-salt independence, no-quota/zero-quota eviction)_defaultsalt sentinel)--coordinator-l2-event-reportingis set🤖 Generated with Claude Code