perf(async_data): move gather phase off forward thread into background executor#370
Draft
Copilot wants to merge 2 commits into
Draft
Conversation
Previously, submit_store performed the gather kernel launch (including _event.wait() and gather_paged_kv_to_cpu()) directly on the forward thread. When the copy stream has a pending event-wait (for the forward pass to finish), CUDA runtime throttles the CPU as kernels queue up on a stream with unresolved dependencies, blocking the forward thread for ~38ms on every store. This commit moves the entire gather phase into the background _commit_after_gather thread via the commit_executor. The forward thread now only does lightweight preparation (prepare_store, buffer allocation) and immediately submits the work and returns. Background thread now: 1. Acquires copy stream context 2. Inserts event-level wait for forward completion 3. Launches gather_paged_kv_to_cpu() 4. Records gather_done event on copy stream 5. Adds gather_done to _inflight_gather_events (under lock) 6. Synchronizes gather_done (waits for GPU gather to finish) 7. Calls commit_store() and resolves the future Also removes profiling remnants: import time, t00/t1/t2/t3/t4/t11 timing variables, Store Profiler logger.info calls, and the two torch_dev.synchronize() calls that were added for profiling only.
Copilot
AI
changed the title
[WIP] Move gather phase to background thread for submit_store method
perf(async_data): move gather phase off forward thread into background executor
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
submit_storewas launchinggather_paged_kv_to_cpuon the forward thread while the copy stream had a pending event-wait for the forward pass. CUDA runtime back-pressures the CPU when queuing kernels to a stream with unresolved dependencies, blocking the forward thread ~38ms per store.Changes
Move gather entirely to background thread: The
with torch_dev.stream(self._copy_stream):block —_event.wait(),gather_paged_kv_to_cpu(), andgather_done.record()— is now inside_commit_after_gatherrunning oncommit_executor.submit_storeonly doesprepare_store+ buffer allocation, then submits and returns immediately._inflight_gather_eventstracking preserved:gather_doneis added to the set in the background thread under lock, afterrecord(), beforesynchronize().flush_inflight_gathers()semantics are unchanged.Remove profiling remnants: Dropped
import time, allt00/t1/.../t11timing variables, the twotorch_dev.synchronize()calls, and all[Store Profiler]log lines that were left in from profiling sessions.Before (forward thread blocked):
After (forward thread returns immediately):