Async data path by hlin99 · Pull Request #368 · hlin99/LMCache

hlin99 · 2026-06-15T02:28:22Z

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Drop the dead-field cases (resumed_from_preemption / evicted_req_ids), which do not exist on vLLM main's CachedRequestData / SchedulerOutput, and keep only the real signals (resumed_req_ids, preempted_req_ids) plus the conservative unknown-schema fallback.

_scheduler_step_needs_flush previously probed two fields that do not exist on vLLM main's schema: CachedRequestData.resumed_from_preemption (replaced by resumed_req_ids) and SchedulerOutput.evicted_req_ids (never existed). Those getattr checks were dead code and the comment was inaccurate. Verified against vLLM main (vllm/v1/core/sched/output.py): - CachedRequestData.resumed_req_ids: set[str] -> real resume signal - SchedulerOutput.preempted_req_ids: set[str] | None -> real preempt signal (populated unconditionally in scheduler.py) Keep only those two real signals plus the conservative unknown-schema fallback (flush when scheduled_cached_reqs lacks resumed_req_ids). This matches the test cleanup in the previous commit; behavior on real vLLM is unchanged.

…eMode error in commit thread PyTorch's InferenceMode propagates to child threads. The commit thread inherits InferenceMode from the vLLM EngineCore main thread, causing `shm_view.copy_(staged)` to raise: "Inplace update to inference tensor outside InferenceMode is not allowed" Fix by explicitly exiting InferenceMode for the inplace copy operation.

…ant staging copy When SHM out_buffers are available from prepare_store(), gather directly into them on the copy stream — matching the synchronous DataTransferContext behavior. This removes: 1. The redundant pinned staging buffer allocation for SHM path 2. The staged→shm_view copy_ in the commit thread 3. The InferenceMode error caused by that copy_ Only the pickle path (no SHM) still uses pinned staging buffers.

Signed-off-by: Tony Lin <tony.lin@intel.com>

Previously, submit_store performed the gather kernel launch (including _event.wait() and gather_paged_kv_to_cpu()) directly on the forward thread. When the copy stream has a pending event-wait (for the forward pass to finish), CUDA runtime throttles the CPU as kernels queue up on a stream with unresolved dependencies, blocking the forward thread for ~38ms on every store. This commit moves the entire gather phase into the background _commit_after_gather thread via the commit_executor. The forward thread now only does lightweight preparation (prepare_store, buffer allocation) and immediately submits the work and returns. Background thread now: 1. Acquires copy stream context 2. Inserts event-level wait for forward completion 3. Launches gather_paged_kv_to_cpu() 4. Records gather_done event on copy stream 5. Adds gather_done to _inflight_gather_events (under lock) 6. Synchronizes gather_done (waits for GPU gather to finish) 7. Calls commit_store() and resolves the future Also removes profiling remnants: import time, t00/t1/t2/t3/t4/t11 timing variables, Store Profiler logger.info calls, and the two torch_dev.synchronize() calls that were added for profiling only.

Signed-off-by: Tony Lin <tony.lin@intel.com>

… log

- worker_transfer.py: Add import time + timing to HandleTransferContext.submit_store() with [FWD-IPC] log covering ipc_handle, send_request, to_cuda_future, and total ms - gpu_transfer.py: Add granular timing to GPUTransferModule.store() with [GPU-STORE] summary log and per-chunk [GPU-STORE-CHUNK] logs covering kernel launch and memcpy_d2h

…eError risk

…ed MP transfer primitive (LMCache#3508) Signed-off-by: Tony Lin <tony.lin@intel.com>

…notable speedup (LMCache#3591) * Perf: optimize Python fallback block transfer for 3x speedup - Optimize fallback block-id and D2H staging overhead - Restructure per-layer transfer loops to iterate over objects first then layers Signed-off-by: Tony Lin <tony.lin@intel.com> * apply gemini's suggestion Signed-off-by: Tony Lin <tony.lin@intel.com> * optimize flash_infer block transfer paths in python fallback Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com>

Signed-off-by: Tony Lin <tony.lin@intel.com>

Copilot AI and others added 28 commits June 6, 2026 00:57

Initial plan

da02f02

Make MP non-GPU store path fully async with preemption-aware flush

effa7e6

Gate async non-GPU store on device capability with sync fallback

486bd20

Add docstrings and clarify preemption schema comments per review

5b8a463

Refactor async non-GPU store into dedicated AsyncDataTransferContext

8a1e283

Improve AsyncDataTransferContext docstrings per review

2ba9c49

add logs

8be4642

Signed-off-by: Tony Lin <tony.lin@intel.com>

Fix SHM worker host registration

1dd3bd7

Polish SHM pinning validation logs

b0374ad

add logs

4aa4c12

Signed-off-by: Tony Lin <tony.lin@intel.com>

remove semaphone

9a3574a

Signed-off-by: Tony Lin <tony.lin@intel.com>

Add comprehensive profiling instrumentation to async_data.py

98eff82

Add comprehensive profiling instrumentation to async_data.py

0fe5e8a

fix log

20190bc

Signed-off-by: Tony Lin <tony.lin@intel.com>

Fix missing total argument and use outer-scope used_shm_direct in FWD…

fe220ce

… log

Fix timing variable scoping: initialize before try block to avoid Nam…

01c2d7e

…eError risk

Add E2E timing from submit_store_request to get_finished

357aab4

Remove redundant str() in E2E-STORE log call

11529a7

feat(ops): add multi_layer_block_kv_transfer Python fallback as unifi…

959b005

…ed MP transfer primitive (LMCache#3508) Signed-off-by: Tony Lin <tony.lin@intel.com>

add log

49e060d

Signed-off-by: Tony Lin <tony.lin@intel.com>

add use_c_ops

3e6deea

Copilot AI mentioned this pull request Jun 17, 2026

Cherry-pick PR #368 async data path onto ww26_PR_async_data #374

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async data path#368

Async data path#368
hlin99 wants to merge 28 commits into
devfrom
copilot/ww24-pr-async-again

hlin99 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hlin99 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants