feat(storage): add write-back to local CPU for non-blocking get paths#3
Draft
jooho-XCENA wants to merge 182 commits into
Draft
feat(storage): add write-back to local CPU for non-blocking get paths#3jooho-XCENA wants to merge 182 commits into
jooho-XCENA wants to merge 182 commits into
Conversation
00642cc to
989e036
Compare
- get_non_blocking: add done callback to write-back fetched data to LocalCPUBackend, matching existing get() behavior - prefetch_single_done_callback: write-back prefetched data to LocalCPUBackend after async prefetch completes Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Align error handling with prefetch_single_done_callback for consistency. Prevents unhandled exceptions in Future callbacks. Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
989e036 to
e419eaf
Compare
Align with existing get() and batched_get() which exclude MaruBackend from write-back to LocalCPUBackend. Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
* add new workload to cli bench Signed-off-by: deng451e <838677410@qq.com>
…che#2922) Add a top-level `gds_path_sharding` config field (default: "by_gpu") that controls how GPUs are assigned to storage paths when multiple comma-separated paths are provided in `gds_path`. This replaces the previously hardcoded by_gpu logic with an explicit, extensible setting. Currently only "by_gpu" is supported (selects path via `device_id % num_paths`); unsupported values raise AssertionError. Generated with [Devin](https://cli.devin.ai/docs) Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com> Co-authored-by: Devin <noreply@cognition.ai>
…he#2949) * [Feat]: Add environment variable support for RESP adapter auth Support LMCACHE_RESP_USERNAME, LMCACHE_RESP_PASSWORD, LMCACHE_RESP_HOST, and LMCACHE_RESP_PORT environment variables in both MP and non-MP modes. Env vars are read inside the adapter at creation time so credentials are never stored in the config object or printed in startup logs. Signed-off-by: Samuel Shen <slshen@tensormesh.ai> * [Feat]: Fix env var precedence and add unit tests for RESP env vars Change precedence so config/CLI args override env vars (env vars serve as defaults). Add unit tests for the precedence logic in both MP and non-MP modes. Signed-off-by: Samuel Shen <slshen@tensormesh.ai> --------- Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
) * Refactor: Auto-align pd_buffer_size down to nearest chunk size multiple - Add buffer size alignment logic to prevent assertion error - Calculate aligned_buffer_size as (origin_size // chunk_size) * chunk_size - Add informative logging when buffer size is adjusted - Release excess buffer memory that can't be aligned - Follows the same pattern as local_cpu_backend.py Signed-off-by: Tony Lin <tony.lin@intel.com> * refine the code per gemini's suggestions Signed-off-by: Tony Lin <tony.lin@intel.com> * refine log msg Signed-off-by: Tony Lin <tony.lin@intel.com> * streamline pd backend buffer alignement Signed-off-by: Tony Lin <tony.lin@intel.com> * Fix test hang by adding backend cleanup Signed-off-by: Tony Lin <tony.lin@intel.com> * remove UT Signed-off-by: Tony Lin <tony.lin@intel.com> * add UT Signed-off-by: Tony Lin <tony.lin@intel.com> * doc update Signed-off-by: Tony Lin <tony.lin@intel.com> * rename function name Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com>
) * [Chore] Add CODEOWNERS for automated PR review assignments Signed-off-by: Samuel Shen <slshen@uchciago.edu> * [Chore] Add sammshen to resp L2 adapter ownership Signed-off-by: Samuel Shen <slshen@uchciago.edu> * [Chore] Add sammshen to csrc/storage_backends and native connector L2 adapters Signed-off-by: Samuel Shen <slshen@uchciago.edu> * [Chore] Add YaoJiayi to L2 eviction ownership Signed-off-by: Samuel Shen <slshen@uchciago.edu> * [Chore] Add OasisGit to multiprocess and http_server ownership Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
- Add type hints to _write_back closure (Future, CacheEngineKey) - Update prefetch_single_done_callback docstring to reflect write-back behavior Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
…#2958) chore(ci): push nightly baselines to LMCache-CI repo The GitHub PAT for the main repo expired, causing nightly baseline uploads to fail. Switch the upload target to the dedicated LMCache/LMCache-CI repository instead of pushing to benchmarks-main on the main LMCache repo. Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Move write-back logic from prefetch_single_done_callback to prefetch_all_done_callback to avoid caching non-contiguous chunks. When a middle tier partially fails, subsequent tiers' chunks break prefix continuity and are discarded by prefetch_all_done_callback. Previously, prefetch_single_done_callback would have already cached those invalid chunks. Now write-back only happens after prefix continuity is validated, ensuring only valid chunks are cached. Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Add source-backend filtering to prefetch_all_done_callback write-back, matching the sync paths (get, batched_get). Chunks from LocalCPUBackend, PDBackend, and MaruBackend are now skipped during write-back, avoiding redundant re-submission. Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Add parameter descriptions and write-back behavior documentation to reflect the new tier_backend_names parameter and write-back logic. Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
* Refactor remote plugin to accept multiply connector. Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Skip module_path/class_name check for built-in adapters Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Add document related Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Fix to use DynamicConnectorAdapter to load external connector plugin Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…MCache#2926) * multiprocess: support per-group KV cache transfer with group_idx - gpu_ops: add group_idx param to lmcache_memcpy_async_h2d/d2h, use memory_obj.get_tensor(group_idx) instead of memory_obj.tensor - kv_layer_groups: add build_kv_layer_groups_from_list() to group layers by (shape, dtype) from a plain tensor list - gpu_context: introduce per-group shape_descs_, hidden_dim_sizes_, group_kv_pointers_, and tmp_gpu_buffers_; update get_kv_buffer_shape, get_tmp_gpu_buffer, get_tmp_gpu_buffer_batched to accept group_idx; add get_shape_desc(group_idx) and get_group_kv_pointers(group_idx) - server: update get_layout_desc, _store_loop, _retrieve_loop to iterate over all groups; fix skip_tokens_in_chunk upper bound to use batch_len instead of _BATCH_SIZE Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> * fix: support vectorized KV transfer for non-16B-aligned head sizes Add scalar type fallback hierarchy for block KV transfer kernel: head_bytes % 16 == 0 -> uint4 (16B, fastest) head_bytes % 4 == 0 -> uint32_t (4B) head_bytes % 2 == 0 -> uint16_t (2B) This fixes the runtime error for MLA models where head_size=132 (uint8), giving head_bytes=132 which is not divisible by 16 but is divisible by 4. Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> --------- Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
12.9 Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
…Cache#2847) Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
LMCache#2801) feat(disk): support multi-path local disk backend with path sharding Allow `local_disk` to accept comma-separated paths (e.g. "/mnt/nvme0/,/mnt/nvme1/") to use multiple NVMe devices. Each GPU worker selects one path at init time via the `local_disk_path_sharding` strategy (currently only "by_gpu": device_id % num_paths), matching the GDS backend approach LMCache#2817 and NIXL approach LMCache#2418. - Path selected once in __init__; _key_to_path, write_file, read_file unchanged from upstream - _parse_local_disk now uses startswith("file://") instead of regex, fixing file:// URIs without a trailing slash - All directories created at startup - Added local_disk_path_sharding config field (default: "by_gpu") - Added tests and updated docs Before this change the only way to increase performance was to use any of the linux multi-pathing technologies to aggregate IOs. Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
vLLM nightly now requires PyTorch 2.11.0 which is built against CUDA 13.0. Update the CI base image to match. Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
document long-doc-permutator workload Signed-off-by: deng451e <838677410@qq.com>
* update csrc to support native launch host func * add deadlock ci test Signed-off-by: ApostaC <yihua98@uchicago.edu>
* fix typo bug Signed-off-by: princepride <wangzhipeng628@gmail.com> * fix: rename hidden_dim_size to hidden_dim_sizes in describe and server Align with the rename introduced in LMCache#2926 where hidden_dim_size was changed to hidden_dim_sizes (List[int]) to support kv_groups. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> * fix: update test fixture to use hidden_dim_sizes key Update test fixture and assertion in test_describe.py to match the hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> --------- Signed-off-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* pin version * pin cu128 wheel Signed-off-by: deng451e <838677410@qq.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
* update prometheus version to fix ut Signed-off-by: ApostaC <yihua98@uchicago.edu> * fix otel sdk version Signed-off-by: ApostaC <yihua98@uchicago.edu> --------- Signed-off-by: ApostaC <yihua98@uchicago.edu>
Changing office hours from Thursdays to Wednesdays Signed-off-by: Nicolas (Nick) Barcet <nijaba@tensormesh.ai>
…V-cache (LMCache#3195) Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
…3194) Reuses the existing L2 long_doc_qa step instead of paying for a second model load. Two changes to that script: 1. Bump --metrics-sample-rate to 1.0 on the L2 relaunch so the histograms record on every event. The default 0.01 would leave them empty in this short workload and flake the assertions. 2. After the existing L2 data-flow checks, add a "Step 5" block that asserts every metric we publish from MP mode actually advances: - newer counters with label dimensions advance > 0 and carry the expected label (l2_store_completed/l2_load_completed by l2_name, lookup_requested_tokens/lookup_hit_tokens by model_name, num_chunks_loaded by worker_id) - the four throughput histograms record at least one observation (lmcache_mp_l0_l1_*_throughput_gbs and lmcache_mp_l2_*_throughput_gbs) The label-presence check catches the case where a counter fires but the attribute plumbing broke — e.g. a future refactor that drops the attribute at emit time but still ticks the counter. Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix: missing lock in HFBucketConnector.close() when clearing metadata cache Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: deng451e <838677410@qq.com>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Sangyoon Kwon <syk0905.kwon@samsung.com>
* [Feat]: Implement batch operations in MooncakeConnector for improved efficiency and error handling Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn> * [Test]: isolate Mooncake RDMA adapter integration test Close the default TCP adapter before creating the RDMA adapter in the buffer-backed Mooncake integration test so Mooncake master does not allocate test replicas on a TCP segment. Also use the native `rdma_devices` config key expected by Mooncake. Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn> * [Fix]: Treat Mooncake exists errors as misses Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn> * [Feat]: Implement delete operations in MooncakeConnector Add do_single_delete and do_batch_delete to the Mooncake storage backend, with integration tests covering key deletion, mixed existing/missing batch deletes, and usage tracking updates. Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn> * [Fix]: resolve ruff F841 and minor formatting cleanups Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn> --------- Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn> Co-authored-by: maobaolong <baoloongmao@tencent.com>
* Add DAX L2 adapter for MP mode Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * Document DAX MP adapter APIs Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> * Use global eviction for DAX storage Remove DAX core's internal victim selection so slot pressure is handled by LMCache's global MP L2 eviction controller. Update DAX tests to cover full-arena behavior, slot-based cache_salt accounting, and StorageManager-driven L2 eviction. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: DongDongJu <commisori28@gmail.com> --------- Signed-off-by: DongDongJu <commisori28@gmail.com> Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Signed-off-by: Dongjoo Seo <commisori28@gmail.com>
…Cache#3159) Signed-off-by: ApostaC <yihua98@uchicago.edu>
…che#3211) Signed-off-by: elliotz <elliot@character.ai>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
…ndpoints in TP=1 non-MP mode (LMCache#3146) * fix(LMCache#3104): use per-instance FastAPI app to fix 503 on cache endpoints in TP=1 non-MP mode Signed-off-by: baoloongmao <baoloongmao@tencent.com> * fix Signed-off-by: baoloongmao <baoloongmao@tencent.com> --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* Add support for AZURE_BLOB NIXL backend The NIXL plugin uses Azure Blob Storage as an object store backend instead of S3. It is designed as a drop-in replacement for the OBJ backend, behaving functionally the same and only differing by required configurations. Currently, it only supports CPU to object store offloading. There is currently no GPU direct support Most of the work was allow listing the AZURE_BLOB plugin in code paths where the OBJ plugin was configured. Specifically, updated the following LMCache interfaces to support AZURE_BLOB: * KV cache offloading with the NIXL storage backend for both static and dynamic pools * L2 storage for nixl_store. Note it was not added to the nixl_store_dynamic because the OBJ plugin was not supported there either Signed-off-by: Kyle Knapp <kyleknapp@microsoft.com> * Fix indent in azure config sample Signed-off-by: Kyle Knapp <kyleknapp@microsoft.com> --------- Signed-off-by: Kyle Knapp <kyleknapp@microsoft.com>
LMCache#3174) Signed-off-by: ApostaC <yihuac@vllm.ai>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: idellzheng <idellzheng@tencent.com>
…che#3092) * [ROCm] Add Triton block-sparse attention backend for CacheBlend Adds LMCTritonSparseBackend as a drop-in replacement for LMCFlashInferSparseBackend that works on both CUDA and ROCm via Triton kernels (no flashinfer dependency). Signed-off-by: Andy Luo <andyluo7@users.noreply.github.com>
…he#3185) Signed-off-by: baoloongmao <baoloongmao@tencent.com>
ci(k3-unit-tests): route unit job to k8s queue The k3-unit-tests pipeline-level config targets the k8s queue for the upload step, but the inner job spec still pinned agents.queue to k3-h200-local. With no agents on k3-h200-local, every spawned unit-test job sat indefinitely as 'waiting'. Align with the other k3 pipelines (blend, multiprocess, integration, comprehensive, correctness) which all run on k8s. Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
ci(comprehensive/pd): write prefiller/decoder/proxy logs to repo root Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* feat(pd_backend): fully async PD KV transfer backend Replace sync PDBackend with async implementation: - Non-blocking transfer: batched_submit_put_task returns immediately (fire-and-forget enqueue) instead of blocking vLLM worker thread until remote alloc + RDMA write complete - Event-driven flow control: replace time.sleep busy-wait polling with Condition-based notification, waking immediately when resources are freed - Self-contained resource release: remove() internally calls ref_count_down() and decrements inflight counter, eliminating caller responsibility for manual cleanup. cache_engine.py updated with _is_sync_pd_backend() guard to prevent double-free - Startup capacity validation: new pd_max_prefill_len config raises ValueError at init if buffer cannot hold the max prefill length, catching misconfiguration before runtime - Configurable timeouts: pd_allocation_timeout_sec, pd_shutdown_timeout_sec, pd_condition_poll_interval_sec replace scattered hardcoded constants - Backward compatible: split into pd_backend.py (sync) and pd_backend_async.py (async), selectable per-instance via pd_backend_mode config (default: "async"). Sync and async instances can coexist in the same cluster — e.g. sender on async while receiver on sync, or vice versa — with no wire protocol incompatibility Signed-off-by: Tony Lin <tony.lin@intel.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Special notes for your reviewers:
If applicable: