perf(async_data): move gather phase off forward thread into background executor by Copilot · Pull Request #370 · hlin99/LMCache

Copilot · 2026-06-16T06:23:20Z

submit_store was launching gather_paged_kv_to_cpu on the forward thread while the copy stream had a pending event-wait for the forward pass. CUDA runtime back-pressures the CPU when queuing kernels to a stream with unresolved dependencies, blocking the forward thread ~38ms per store.

Changes

Move gather entirely to background thread: The with torch_dev.stream(self._copy_stream): block — _event.wait(), gather_paged_kv_to_cpu(), and gather_done.record() — is now inside _commit_after_gather running on commit_executor. submit_store only does prepare_store + buffer allocation, then submits and returns immediately.
_inflight_gather_events tracking preserved: gather_done is added to the set in the background thread under lock, after record(), before synchronize(). flush_inflight_gathers() semantics are unchanged.
Remove profiling remnants: Dropped import time, all t00/t1/.../t11 timing variables, the two torch_dev.synchronize() calls, and all [Store Profiler] log lines that were left in from profiling sessions.

Before (forward thread blocked):

with torch_dev.stream(self._copy_stream):
    _event.wait(stream=self._copy_stream)
    gather_paged_kv_to_cpu(...)   # ~38ms CPU wall time on forward thread
    gather_done = torch_dev.Event()
    gather_done.record(self._copy_stream)
# submit commit to background
commit_executor.submit(_commit_after_gather)

After (forward thread returns immediately):

def _commit_after_gather() -> None:
    with torch_dev.stream(self._copy_stream):
        _event.wait(stream=self._copy_stream)
        gather_paged_kv_to_cpu(...)   # runs fully off forward thread
        gather_done = torch_dev.Event()
        gather_done.record(self._copy_stream)
    with self._inflight_lock:
        self._inflight_gather_events.add(gather_done)
    gather_done.synchronize()
    non_gpu_context.commit_store(...)

commit_executor.submit(_commit_after_gather)  # submit_store returns here

Previously, submit_store performed the gather kernel launch (including _event.wait() and gather_paged_kv_to_cpu()) directly on the forward thread. When the copy stream has a pending event-wait (for the forward pass to finish), CUDA runtime throttles the CPU as kernels queue up on a stream with unresolved dependencies, blocking the forward thread for ~38ms on every store. This commit moves the entire gather phase into the background _commit_after_gather thread via the commit_executor. The forward thread now only does lightweight preparation (prepare_store, buffer allocation) and immediately submits the work and returns. Background thread now: 1. Acquires copy stream context 2. Inserts event-level wait for forward completion 3. Launches gather_paged_kv_to_cpu() 4. Records gather_done event on copy stream 5. Adds gather_done to _inflight_gather_events (under lock) 6. Synchronizes gather_done (waits for GPU gather to finish) 7. Calls commit_store() and resolves the future Also removes profiling remnants: import time, t00/t1/t2/t3/t4/t11 timing variables, Store Profiler logger.info calls, and the two torch_dev.synchronize() calls that were added for profiling only.

Initial plan

0d34393

Copilot AI assigned Copilot and hlin99 Jun 16, 2026

Copilot started work on behalf of hlin99 June 16, 2026 06:23 View session

Copilot AI changed the title ~~[WIP] Move gather phase to background thread for submit_store method~~ perf(async_data): move gather phase off forward thread into background executor Jun 16, 2026

Copilot finished work on behalf of hlin99 June 16, 2026 06:33

Copilot AI requested a review from hlin99 June 16, 2026 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(async_data): move gather phase off forward thread into background executor#370

perf(async_data): move gather phase off forward thread into background executor#370
Copilot wants to merge 2 commits into
copilot/ww24-pr-async-againfrom
copilot/move-gather-phase-to-background-thread

Copilot AI commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Before (forward thread blocked):

After (forward thread returns immediately):

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 16, 2026 •

edited

Loading