Enable async scheduling for decode-only inference#4604
Draft
lmcafee-nvidia wants to merge 70 commits intoNVIDIA:mainfrom
Draft
Enable async scheduling for decode-only inference#4604lmcafee-nvidia wants to merge 70 commits intoNVIDIA:mainfrom
lmcafee-nvidia wants to merge 70 commits intoNVIDIA:mainfrom
Conversation
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations. Commit 1: Add async decode scheduling scaffolding - Added the config and command-line switch for async scheduling. - Added context and controller state for decode-only async setup. - Added focused tests for the new config and context behavior.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations. Commit 2: Add async scheduling decode e2e coverage - Added end-to-end decode coverage with async scheduling enabled. - Checked that the engine still produces the expected decode output. - Added test hooks for the async controller path.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations. Commit 3: Require full-iteration graphs for async scheduling - Blocked async scheduling unless full-iteration CUDA graphs are enabled. - Kept non-graph decode steps on the normal synchronous path. - Made the async eligibility check stricter and safer.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations. Commit 4: Capture async decode graph replay - Captured the async decode work so it can be replayed later. - Stored the replay state needed by the controller. - Added tests for the captured replay path.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations. Commit 5: Launch chained async decode graphs - Launched the next async decode graph earlier in the step. - Reduced the host gap between sampling and the next forward pass. - Added tests that cover chained async graph launches.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations. Commit 6: Avoid single-rank CUDA sync in inference timing - Changed single-rank timing to avoid a CUDA synchronization. - Kept benchmark timing from breaking GPU and CPU overlap. - Preserved the existing synchronized timing behavior where needed.
Plan 2: Extend async scheduling to hybrid MTP decode-only runs by sharing metadata preparation, enabling hybrid/Mamba plus MoE/EP graph-safe paths, and adding MTP async parity coverage. Commit 1: Share decode metadata preparation - Pulled shared decode metadata setup into reusable context code. - Let sync and async decode use the same preparation logic. - Removed duplicate metadata-building logic from the context path.
Plan 2: Extend async scheduling to hybrid MTP decode-only runs by sharing metadata preparation, enabling hybrid/Mamba plus MoE/EP graph-safe paths, and adding MTP async parity coverage. Commit 2: Enable async decode for hybrid and MoE - Allowed async decode to run for hybrid and MoE model paths. - Added the metadata handling those model paths need during decode. - Added tests for hybrid and MoE async decode eligibility.
Plan 2: Extend async scheduling to hybrid MTP decode-only runs by sharing metadata preparation, enabling hybrid/Mamba plus MoE/EP graph-safe paths, and adding MTP async parity coverage. Commit 3: Enable async decode for MTP - Added async decode handling for MTP generation. - Wired hybrid model graph support needed by the MTP path. - Added context and engine tests for MTP async decode behavior.
Plan 3: Tighten the current F->S->H2D->F async path by avoiding sample-copy waits and moving the MTP launch-critical tail toward a prequeued graph packet without attempting full F->S->F. Commit 1: Add async decode sample buffer ring - Added a small ring of sample buffers for async decode. - Avoided waiting on the previous sample copy before launching more work. - Added tests that cover sample buffer rotation.
Plan 3: Tighten the current F->S->H2D->F async path by avoiding sample-copy waits and moving the MTP launch-critical tail toward a prequeued graph packet without attempting full F->S->F. Commit 2: Capture async MTP forward graphs - Captured MTP forward work for async graph replay. - Reused captured MTP graphs in the async decode path. - Added tests for async MTP graph capture and replay.
Plan 3: Tighten the current F->S->H2D->F async path by avoiding sample-copy waits and moving the MTP launch-critical tail toward a prequeued graph packet without attempting full F->S->F. Commit 3: Defer MTP block release after async launch - Delayed MTP block release until after the async launch uses needed state. - Prevented async MTP work from losing blocks too early. - Added test coverage for the deferred release path.
ad82a9c to
88770c5
Compare
Plan 4: Extend async scheduling across decode-only boundary conditions by speculating safe request finishes and KV-cache transitions, then reconciling CPU state after the speculative launch. Commit 1: Preserve async forward rows across finishes - Recorded the active request order for each speculative async forward. - Sampled row-mapped pending logits correctly after finished requests were compacted. - Added GPT finish-boundary coverage that compares async output to serial output.
Plan 4: Extend async scheduling across decode-only boundary conditions by speculating safe request finishes and KV-cache transitions, then reconciling CPU state after the speculative launch. Commit 2: Reserve KV blocks for async decode boundaries - Reserved one-step-ahead KV blocks before speculative decode forwards that cross a block boundary. - Adopted reserved blocks for continuing requests and deferred unused block release until the forward retired. - Added context and GPT engine tests for boundary reservation, adoption, and deferred release.
Plan 4: Extend async scheduling across decode-only boundary conditions by speculating safe request finishes and KV-cache transitions, then reconciling CPU state after the speculative launch. Commit 3: Cover async boundary reconciliation cases - Exhaustively tested two-request boundary adoption and deferred release combinations. - Added staggered request-admission coverage while async forwards are pending. - Confirmed boundary and admission behavior still matches serial decode outputs.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 1: Add async lifecycle test harness. - Add reusable lifecycle counters for finish, MTP finish, pause, evict, and add deferral. - Add baseline tests that pin the current pause and MTP finish disable behavior. - Add parity and pending-forward harness coverage for later lifecycle commits.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 2: Defer request admission across pending async forwards. - Hold new waiting requests out of the context until the pending async forward is consumed or rejected. - Preserve the speculative forward for the old active decode set when staggered arrivals occur. - Resume normal admission immediately after async reconciliation completes.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 3: Complete GPT finish reconciliation. - Accept pending async forwards when GPT requests finish by length or termination id. - Discard extra sampled tokens for finished rows while preserving continuing rows. - Cover mixed continue/finish row maps and all-finished cleanup.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 4: Reconcile MTP finish boundaries. - Reuse row-mapped pending MTP forwards when finished rows leave the active set. - Allow MTP length and termination finishes without disabling async scheduling. - Trim accepted speculative tokens at true termination boundaries.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 5: Reconcile speculative pause boundaries. - Keep async scheduling enabled for active rows while paused rows remain in the context. - Prepare decode-only async metadata from the active slice after paused requests move left. - Cover GPT and hybrid MTP pause token preservation and async continuation.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 6: Reconcile speculative eviction boundaries. - Clear stale paused token buffers when eviction drains the paused set. - Preserve async pending forwards when evicted rows disappear from the active request order. - Cover GPT and hybrid MTP eviction boundaries, KV release, and Mamba slot release.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 7: Cover lifecycle boundary interactions. - Combine add, pause, evict, finish, and KV block transitions in focused interaction tests. - Verify GPT and hybrid MTP async outputs match serial outputs across staggered arrivals and memory pressure. - Assert async scheduling remains active across all supported lifecycle boundaries.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 8: Skip idle MTP termination sync. - Avoid copying accepted MTP tokens to CPU when all active requests have token-id termination disabled. - Keep accepted speculative-token termination handling for requests that do use a termination id. - Add focused tests that guard the no-sync fast path and the termination-hit correctness path.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards. Commit 9: Gate async add barriers by capacity. - Keep async decode chained while waiting requests cannot fit in the context. - Only break the async chain when a waiting request can actually be admitted. - Add focused coverage for the full-context waiting-request case.
Plan 8: Support generated-token logprobs and top-n logprobs without disabling async decode scheduling. Commit 6: Avoid generated-logprob metadata scans on requests that never asked for logprobs. - Request admission records whether any request needs generated logprob bookkeeping. - The async decode hot path skips the extra metadata scan when no logprob request has been seen. - A focused test covers the no-logprob fast gate and verifies logprob requests still enable the scan.
Plan 8: Support generated-token logprobs and top-n logprobs without disabling async decode scheduling. Commit 7: Skip sampling/logprob bookkeeping on async steps unless generated logprobs need it. - Async and pending-forward steps without logprob requests now keep the original fast path. - Generated-logprob requests still collect bookkeeping when logits are available for calculation. - Focused tests cover the no-logprob skip path, launched-sample skip path, and generated-logprob collect path.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation. Commit 1: Generalize async sampling without enabling non-greedy requests yet. - Added a shared helper that samples into the next-step input buffer. - Kept the greedy fast path for existing async decode graph launches. - Forwarded async decode graph test config so graph-focused tests exercise the intended path.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation. Commit 2: Allow top-k requests to use async forward overlap. - Removed the top-k eligibility block for decode-only async scheduling. - Kept top-k off the greedy captured decode graph so random sampling uses eager sampling plus async forward. - Added focused GPT tests that compare serial and async top-k output.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation. Commit 3: Allow top-p requests to use async forward overlap. - Removed the top-p eligibility block for decode-only async scheduling. - Routed top-p sampling through eager sampling and the existing async forward path. - Added focused GPT tests that compare serial and async top-p output.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation. Commit 4: Make dynamic non-greedy RNG independent of batch scheduling. - Added per-request CUDA generators for dynamic non-greedy sampling. - Reused each request's generator across top-k, top-p, and speculative sampling helpers. - Added a staggered-add test that catches RNG drift when async defers admission.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation. Commit 5: Validate non-greedy async scheduling for hybrid MTP decode. - Added hybrid MTP top-k and top-p parity tests for async decode. - Checked that MTP async eligibility allows non-greedy sampling. - Confirmed hybrid MTP uses async forward overlap without the greedy decode graph.
Plan 10: Make prompt and echo logprob requests compatible with async decode after prefill. Commit 1: Stop treating completed prompt logprob work as a decode-time async blocker. - Removed the decode-only async gate that rejected prompt logprob metadata after prefill. - Kept non-decode steps ineligible so prompt logprobs are still collected during prefill. - Updated the eligibility matrix for GPT and hybrid MTP prompt logprob requests.
Plan 10: Make prompt and echo logprob requests compatible with async decode after prefill. Commit 2: Validate async decode resumes after prompt logprob prefill. - Added a GPT parity test with prompt and generated logprobs enabled. - Verified prompt logprobs are collected before decode-only async scheduling starts. - Confirmed generated tokens and logprobs match the serial path after async resume.
Plan 10: Make prompt and echo logprob requests compatible with async decode after prefill. Commit 3: Validate prompt top-N logprob compatibility with async decode. - Added async GPT parity coverage for prompt and generated top-N logprobs. - Verified prompt top-N results are collected during prefill before async decode resumes. - Tightened the existing dynamic top-N test to avoid stochastic EOS and sampling mismatches.
Plan 10: Make prompt and echo logprob requests compatible with async decode after prefill. Commit 4: Validate generate API prompt logprob compatibility with async decode. - Added generate API parity coverage for prompt and generated logprobs with top-N results. - Let the generate API test helper build a full-prompt-logit context when prompt logprobs are requested. - Confirmed async decode resumes after prompt logprob prefill without using the unsupported decode graph path.
Plan 11: Allow nonzero inference logging intervals to use async scheduling by making logged steps explicit pipeline barriers instead of a global async disable. Commit 1: Add a controller-level barrier for temporary async scheduling breaks. - Added set and clear helpers for per-step async scheduling barriers. - Stopped treating every nonzero logging interval as a global async disable. - Added focused eligibility tests for logging intervals and barrier diagnostics.
Plan 11: Allow nonzero inference logging intervals to use async scheduling by making logged steps explicit pipeline barriers instead of a global async disable. Commit 2: Wire logging steps into temporary async scheduling barriers. - Added engine-side barriers before and during timed logging steps. - Let async scheduling resume on decode steps outside the logging barrier window. - Added focused GPT, hybrid MTP, and interval-one tests for logging-enabled async decode.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tests