Skip to content

feat(storage): add write-back to local CPU for non-blocking get paths#3

Draft
jooho-XCENA wants to merge 182 commits into
devfrom
feat/write-back-non-blocking
Draft

feat(storage): add write-back to local CPU for non-blocking get paths#3
jooho-XCENA wants to merge 182 commits into
devfrom
feat/write-back-non-blocking

Conversation

@jooho-XCENA

Copy link
Copy Markdown
Owner
  • get_non_blocking: add done callback to write-back fetched data to LocalCPUBackend, matching existing get() behavior
  • prefetch_single_done_callback: write-back prefetched data to LocalCPUBackend after async prefetch completes

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

@jooho-XCENA jooho-XCENA force-pushed the feat/write-back-non-blocking branch from 00642cc to 989e036 Compare April 3, 2026 07:39
- get_non_blocking: add done callback to write-back fetched data
  to LocalCPUBackend, matching existing get() behavior
- prefetch_single_done_callback: write-back prefetched data to
  LocalCPUBackend after async prefetch completes

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Align error handling with prefetch_single_done_callback for
consistency. Prevents unhandled exceptions in Future callbacks.

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
@jooho-XCENA jooho-XCENA force-pushed the feat/write-back-non-blocking branch from 989e036 to e419eaf Compare April 3, 2026 07:57
jooho-XCENA and others added 26 commits April 3, 2026 08:10
Align with existing get() and batched_get() which exclude
MaruBackend from write-back to LocalCPUBackend.

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
* add new workload to cli bench

Signed-off-by: deng451e <838677410@qq.com>
…che#2922)

Add a top-level `gds_path_sharding` config field (default: "by_gpu")
that controls how GPUs are assigned to storage paths when multiple
comma-separated paths are provided in `gds_path`. This replaces the
previously hardcoded by_gpu logic with an explicit, extensible setting.

Currently only "by_gpu" is supported (selects path via
`device_id % num_paths`); unsupported values raise AssertionError.

Generated with [Devin](https://cli.devin.ai/docs)

Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
Co-authored-by: Devin <noreply@cognition.ai>
…he#2949)

* [Feat]: Add environment variable support for RESP adapter auth

Support LMCACHE_RESP_USERNAME, LMCACHE_RESP_PASSWORD, LMCACHE_RESP_HOST,
and LMCACHE_RESP_PORT environment variables in both MP and non-MP modes.
Env vars are read inside the adapter at creation time so credentials are
never stored in the config object or printed in startup logs.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

* [Feat]: Fix env var precedence and add unit tests for RESP env vars

Change precedence so config/CLI args override env vars (env vars serve
as defaults). Add unit tests for the precedence logic in both MP and
non-MP modes.

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

---------

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
)

* Refactor: Auto-align pd_buffer_size down to nearest chunk size multiple

- Add buffer size alignment logic to prevent assertion error
- Calculate aligned_buffer_size as (origin_size // chunk_size) * chunk_size
- Add informative logging when buffer size is adjusted
- Release excess buffer memory that can't be aligned
- Follows the same pattern as local_cpu_backend.py

Signed-off-by: Tony Lin <tony.lin@intel.com>

* refine the code per gemini's suggestions

Signed-off-by: Tony Lin <tony.lin@intel.com>

* refine log msg

Signed-off-by: Tony Lin <tony.lin@intel.com>

* streamline pd backend buffer alignement

Signed-off-by: Tony Lin <tony.lin@intel.com>

* Fix test hang by adding backend cleanup

Signed-off-by: Tony Lin <tony.lin@intel.com>

* remove UT

Signed-off-by: Tony Lin <tony.lin@intel.com>

* add UT

Signed-off-by: Tony Lin <tony.lin@intel.com>

* doc update

Signed-off-by: Tony Lin <tony.lin@intel.com>

* rename function name

Signed-off-by: Tony Lin <tony.lin@intel.com>

---------

Signed-off-by: Tony Lin <tony.lin@intel.com>
)

* [Chore] Add CODEOWNERS for automated PR review assignments

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* [Chore] Add sammshen to resp L2 adapter ownership

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* [Chore] Add sammshen to csrc/storage_backends and native connector L2 adapters

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* [Chore] Add YaoJiayi to L2 eviction ownership

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

* [Chore] Add OasisGit to multiprocess and http_server ownership

Signed-off-by: Samuel Shen <slshen@uchciago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
- Add type hints to _write_back closure (Future, CacheEngineKey)
- Update prefetch_single_done_callback docstring to reflect
  write-back behavior

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
…#2958)

chore(ci): push nightly baselines to LMCache-CI repo

The GitHub PAT for the main repo expired, causing nightly baseline
uploads to fail. Switch the upload target to the dedicated
LMCache/LMCache-CI repository instead of pushing to benchmarks-main
on the main LMCache repo.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Move write-back logic from prefetch_single_done_callback to
prefetch_all_done_callback to avoid caching non-contiguous chunks.

When a middle tier partially fails, subsequent tiers' chunks break
prefix continuity and are discarded by prefetch_all_done_callback.
Previously, prefetch_single_done_callback would have already cached
those invalid chunks. Now write-back only happens after prefix
continuity is validated, ensuring only valid chunks are cached.

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Add source-backend filtering to prefetch_all_done_callback write-back,
matching the sync paths (get, batched_get). Chunks from LocalCPUBackend,
PDBackend, and MaruBackend are now skipped during write-back, avoiding
redundant re-submission.

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Add parameter descriptions and write-back behavior documentation
to reflect the new tier_backend_names parameter and write-back logic.

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
* Refactor remote plugin to accept multiply connector.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Skip module_path/class_name check for built-in adapters

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Add document related

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Fix to use DynamicConnectorAdapter to load external connector plugin

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

---------

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…MCache#2926)

* multiprocess: support per-group KV cache transfer with group_idx

- gpu_ops: add group_idx param to lmcache_memcpy_async_h2d/d2h,
  use memory_obj.get_tensor(group_idx) instead of memory_obj.tensor
- kv_layer_groups: add build_kv_layer_groups_from_list() to group
  layers by (shape, dtype) from a plain tensor list
- gpu_context: introduce per-group shape_descs_, hidden_dim_sizes_,
  group_kv_pointers_, and tmp_gpu_buffers_; update get_kv_buffer_shape,
  get_tmp_gpu_buffer, get_tmp_gpu_buffer_batched to accept group_idx;
  add get_shape_desc(group_idx) and get_group_kv_pointers(group_idx)
- server: update get_layout_desc, _store_loop, _retrieve_loop to
  iterate over all groups; fix skip_tokens_in_chunk upper bound to
  use batch_len instead of _BATCH_SIZE

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>

* fix: support vectorized KV transfer for non-16B-aligned head sizes

Add scalar type fallback hierarchy for block KV transfer kernel:
  head_bytes % 16 == 0  -> uint4    (16B, fastest)
  head_bytes % 4  == 0  -> uint32_t (4B)
  head_bytes % 2  == 0  -> uint16_t (2B)

This fixes the runtime error for MLA models where head_size=132 (uint8),
giving head_bytes=132 which is not divisible by 16 but is divisible by 4.

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>

---------

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
12.9

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
LMCache#2801)

feat(disk): support multi-path local disk backend with path sharding

Allow `local_disk` to accept comma-separated paths (e.g.
"/mnt/nvme0/,/mnt/nvme1/") to use multiple NVMe devices.  Each GPU
worker selects one path at init time via the `local_disk_path_sharding`
strategy (currently only "by_gpu": device_id % num_paths), matching
the GDS backend approach LMCache#2817 and NIXL approach LMCache#2418.

- Path selected once in __init__; _key_to_path, write_file, read_file
  unchanged from upstream
- _parse_local_disk now uses startswith("file://") instead of regex,
  fixing file:// URIs without a trailing slash
- All directories created at startup
- Added local_disk_path_sharding config field (default: "by_gpu")
- Added tests and updated docs

Before this change the only way to increase performance was to use
any of the linux multi-pathing technologies to aggregate IOs.

Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
vLLM nightly now requires PyTorch 2.11.0 which is built against
CUDA 13.0. Update the CI base image to match.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
document long-doc-permutator workload

Signed-off-by: deng451e <838677410@qq.com>
* update csrc to support native launch host func

* add deadlock ci test

Signed-off-by: ApostaC <yihua98@uchicago.edu>
* fix typo bug

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* fix: rename hidden_dim_size to hidden_dim_sizes in describe and server

Align with the rename introduced in LMCache#2926 where hidden_dim_size was
changed to hidden_dim_sizes (List[int]) to support kv_groups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>

* fix: update test fixture to use hidden_dim_sizes key

Update test fixture and assertion in test_describe.py to match the
hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>

---------

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* pin version
* pin cu128 wheel

Signed-off-by: deng451e <838677410@qq.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
* update prometheus version to fix ut

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix otel sdk version

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Changing office hours from Thursdays to Wednesdays

Signed-off-by: Nicolas (Nick) Barcet <nijaba@tensormesh.ai>
sammshen and others added 30 commits May 5, 2026 11:55
…V-cache (LMCache#3195)

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
…3194)

Reuses the existing L2 long_doc_qa step instead of paying for a second
model load. Two changes to that script:

1. Bump --metrics-sample-rate to 1.0 on the L2 relaunch so the
   histograms record on every event. The default 0.01 would leave them
   empty in this short workload and flake the assertions.

2. After the existing L2 data-flow checks, add a "Step 5" block that
   asserts every metric we publish from MP mode actually advances:
   - newer counters with label dimensions advance > 0 and carry the
     expected label (l2_store_completed/l2_load_completed by l2_name,
     lookup_requested_tokens/lookup_hit_tokens by model_name,
     num_chunks_loaded by worker_id)
   - the four throughput histograms record at least one observation
     (lmcache_mp_l0_l1_*_throughput_gbs and lmcache_mp_l2_*_throughput_gbs)

The label-presence check catches the case where a counter fires but the
attribute plumbing broke — e.g. a future refactor that drops the
attribute at emit time but still ticks the counter.

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix: missing lock in HFBucketConnector.close() when clearing metadata cache

Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io>
Signed-off-by: aeon-x <talexcao@gmail.com>
Signed-off-by: deng451e <838677410@qq.com>
Signed-off-by: Sangyoon Kwon <syk0905.kwon@samsung.com>
* [Feat]: Implement batch operations in MooncakeConnector for improved efficiency and error handling

Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn>

* [Test]: isolate Mooncake RDMA adapter integration test

Close the default TCP adapter before creating the RDMA adapter in the
buffer-backed Mooncake integration test so Mooncake master does not allocate
test replicas on a TCP segment. Also use the native `rdma_devices` config key
expected by Mooncake.

Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn>

* [Fix]: Treat Mooncake exists errors as misses

Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn>

* [Feat]: Implement delete operations in MooncakeConnector

Add do_single_delete and do_batch_delete to the Mooncake storage
backend, with integration tests covering key deletion, mixed
existing/missing batch deletes, and usage tracking updates.

Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn>

* [Fix]: resolve ruff F841 and minor formatting cleanups

Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn>

---------

Signed-off-by: fangchizheng <fangchizheng@mail.ustc.edu.cn>
Co-authored-by: maobaolong <baoloongmao@tencent.com>
* Add DAX L2 adapter for MP mode

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* Document DAX MP adapter APIs

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

* Use global eviction for DAX storage

Remove DAX core's internal victim selection so slot pressure is handled by LMCache's global MP L2 eviction controller. Update DAX tests to cover full-arena behavior, slot-based cache_salt accounting, and StorageManager-driven L2 eviction.

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>

Signed-off-by: DongDongJu <commisori28@gmail.com>

---------

Signed-off-by: DongDongJu <commisori28@gmail.com>
Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Signed-off-by: Dongjoo Seo <commisori28@gmail.com>
)

* [Obs] Expose blend token-level hit-rate counters

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
)

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
…ndpoints in TP=1 non-MP mode (LMCache#3146)

* fix(LMCache#3104): use per-instance FastAPI app to fix 503 on cache endpoints in TP=1 non-MP mode

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* fix

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

---------

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
* Add support for AZURE_BLOB NIXL backend

The NIXL plugin uses Azure Blob Storage as an object store backend
instead of S3. It is designed as a drop-in replacement for the OBJ
backend, behaving functionally the same and only differing by
required configurations. Currently, it only supports CPU to object
store offloading. There is currently no GPU direct support

Most of the work was allow listing the AZURE_BLOB plugin in code
paths where the OBJ plugin was configured. Specifically, updated
the following LMCache interfaces to support AZURE_BLOB:

* KV cache offloading with the NIXL storage backend for both static
  and dynamic pools
* L2 storage for nixl_store. Note it was not added to the
  nixl_store_dynamic because the OBJ plugin was not supported there
  either

Signed-off-by: Kyle Knapp <kyleknapp@microsoft.com>

* Fix indent in azure config sample

Signed-off-by: Kyle Knapp <kyleknapp@microsoft.com>

---------

Signed-off-by: Kyle Knapp <kyleknapp@microsoft.com>
Signed-off-by: idellzheng <idellzheng@tencent.com>
…che#3092)

* [ROCm] Add Triton block-sparse attention backend for CacheBlend

Adds LMCTritonSparseBackend as a drop-in replacement for
LMCFlashInferSparseBackend that works on both CUDA and ROCm
via Triton kernels (no flashinfer dependency).

Signed-off-by: Andy Luo <andyluo7@users.noreply.github.com>
…he#3185)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
ci(k3-unit-tests): route unit job to k8s queue

The k3-unit-tests pipeline-level config targets the k8s queue for the
upload step, but the inner job spec still pinned agents.queue to
k3-h200-local. With no agents on k3-h200-local, every spawned unit-test
job sat indefinitely as 'waiting'. Align with the other k3 pipelines
(blend, multiprocess, integration, comprehensive, correctness) which
all run on k8s.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
ci(comprehensive/pd): write prefiller/decoder/proxy logs to repo root

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* feat(pd_backend): fully async PD KV transfer backend

Replace sync PDBackend with async implementation:
- Non-blocking transfer: batched_submit_put_task returns immediately
  (fire-and-forget enqueue) instead of blocking vLLM worker thread
  until remote alloc + RDMA write complete
- Event-driven flow control: replace time.sleep busy-wait polling
  with Condition-based notification, waking immediately when
  resources are freed
- Self-contained resource release: remove() internally calls
  ref_count_down() and decrements inflight counter, eliminating
  caller responsibility for manual cleanup. cache_engine.py
  updated with _is_sync_pd_backend() guard to prevent double-free
- Startup capacity validation: new pd_max_prefill_len config raises
  ValueError at init if buffer cannot hold the max prefill length,
  catching misconfiguration before runtime
- Configurable timeouts: pd_allocation_timeout_sec,
  pd_shutdown_timeout_sec, pd_condition_poll_interval_sec replace
  scattered hardcoded constants
- Backward compatible: split into pd_backend.py (sync) and
  pd_backend_async.py (async), selectable per-instance via
  pd_backend_mode config (default: "async"). Sync and async
  instances can coexist in the same cluster — e.g. sender on
  async while receiver on sync, or vice versa — with no wire
  protocol incompatibility

Signed-off-by: Tony Lin <tony.lin@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.