Skip to content

perf(async_data): move gather phase off forward thread into background executor#370

Draft
Copilot wants to merge 2 commits into
copilot/ww24-pr-async-againfrom
copilot/move-gather-phase-to-background-thread
Draft

perf(async_data): move gather phase off forward thread into background executor#370
Copilot wants to merge 2 commits into
copilot/ww24-pr-async-againfrom
copilot/move-gather-phase-to-background-thread

Conversation

Copilot AI commented Jun 16, 2026

Copy link
Copy Markdown

submit_store was launching gather_paged_kv_to_cpu on the forward thread while the copy stream had a pending event-wait for the forward pass. CUDA runtime back-pressures the CPU when queuing kernels to a stream with unresolved dependencies, blocking the forward thread ~38ms per store.

Changes

  • Move gather entirely to background thread: The with torch_dev.stream(self._copy_stream): block — _event.wait(), gather_paged_kv_to_cpu(), and gather_done.record() — is now inside _commit_after_gather running on commit_executor. submit_store only does prepare_store + buffer allocation, then submits and returns immediately.

  • _inflight_gather_events tracking preserved: gather_done is added to the set in the background thread under lock, after record(), before synchronize(). flush_inflight_gathers() semantics are unchanged.

  • Remove profiling remnants: Dropped import time, all t00/t1/.../t11 timing variables, the two torch_dev.synchronize() calls, and all [Store Profiler] log lines that were left in from profiling sessions.

Before (forward thread blocked):

with torch_dev.stream(self._copy_stream):
    _event.wait(stream=self._copy_stream)
    gather_paged_kv_to_cpu(...)   # ~38ms CPU wall time on forward thread
    gather_done = torch_dev.Event()
    gather_done.record(self._copy_stream)
# submit commit to background
commit_executor.submit(_commit_after_gather)

After (forward thread returns immediately):

def _commit_after_gather() -> None:
    with torch_dev.stream(self._copy_stream):
        _event.wait(stream=self._copy_stream)
        gather_paged_kv_to_cpu(...)   # runs fully off forward thread
        gather_done = torch_dev.Event()
        gather_done.record(self._copy_stream)
    with self._inflight_lock:
        self._inflight_gather_events.add(gather_done)
    gather_done.synchronize()
    non_gpu_context.commit_store(...)

commit_executor.submit(_commit_after_gather)  # submit_store returns here

Previously, submit_store performed the gather kernel launch (including
_event.wait() and gather_paged_kv_to_cpu()) directly on the forward
thread. When the copy stream has a pending event-wait (for the forward
pass to finish), CUDA runtime throttles the CPU as kernels queue up on
a stream with unresolved dependencies, blocking the forward thread for
~38ms on every store.

This commit moves the entire gather phase into the background
_commit_after_gather thread via the commit_executor. The forward thread
now only does lightweight preparation (prepare_store, buffer allocation)
and immediately submits the work and returns.

Background thread now:
1. Acquires copy stream context
2. Inserts event-level wait for forward completion
3. Launches gather_paged_kv_to_cpu()
4. Records gather_done event on copy stream
5. Adds gather_done to _inflight_gather_events (under lock)
6. Synchronizes gather_done (waits for GPU gather to finish)
7. Calls commit_store() and resolves the future

Also removes profiling remnants: import time, t00/t1/t2/t3/t4/t11
timing variables, Store Profiler logger.info calls, and the two
torch_dev.synchronize() calls that were added for profiling only.
Copilot AI changed the title [WIP] Move gather phase to background thread for submit_store method perf(async_data): move gather phase off forward thread into background executor Jun 16, 2026
Copilot AI requested a review from hlin99 June 16, 2026 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants