Skip to content

Enable async scheduling for decode-only inference#4604

Draft
lmcafee-nvidia wants to merge 70 commits intoNVIDIA:mainfrom
lmcafee-nvidia:context-cpu-async-schedule-weekend
Draft

Enable async scheduling for decode-only inference#4604
lmcafee-nvidia wants to merge 70 commits intoNVIDIA:mainfrom
lmcafee-nvidia:context-cpu-async-schedule-weekend

Conversation

@lmcafee-nvidia
Copy link
Copy Markdown
Contributor

Summary

  • Add decode-only async scheduling support with full-iteration CUDA graph chaining.
  • Extend async decode coverage across GPT, hybrid, MoE/EP, and MTP paths.
  • Preserve graph-backed hybrid MTP hidden states so async CUDA graph replay can feed MTP sampling.

Tests

  • python -m py_compile megatron/core/inference/contexts/dynamic_context.py megatron/core/inference/text_generation_controllers/text_generation_controller.py megatron/core/models/hybrid/hybrid_model.py tests/unit_tests/inference/contexts/test_dynamic_context.py tests/unit_tests/inference/engines/test_dynamic_engine.py
  • git diff --check
  • /opt/venv/bin/python -m torch.distributed.run --nproc-per-node 8 -m pytest -q tests/unit_tests/inference/contexts/test_dynamic_context.py::TestDynamicContext::test_prepare_async_decode_next_step_preserves_request_source_of_truth
  • /opt/venv/bin/python -m torch.distributed.run --nproc-per-node 8 -m pytest -q -vv --tb=short tests/unit_tests/inference/engines/test_dynamic_engine.py::TestDynamicInferenceEngine::test_async_scheduling_decode_only_mtp_cuda_graph_e2e[gpt]
  • /opt/venv/bin/python -m torch.distributed.run --nproc-per-node 8 -m pytest -q -vv --tb=short tests/unit_tests/inference/engines/test_dynamic_engine.py::TestDynamicInferenceEngine::test_async_scheduling_decode_only_mtp_cuda_graph_e2e[hybrid]
  • /opt/venv/bin/python -m torch.distributed.run --nproc-per-node 8 -m pytest -q -vv --tb=short tests/unit_tests/inference/engines/test_dynamic_engine.py::TestDynamicInferenceEngineParallel::test_async_scheduling_decode_only_hybrid_moe_ep_mtp_cuda_graph_e2e
  • /opt/venv/bin/python -m torch.distributed.run --nproc-per-node 8 -m pytest -q -vv --tb=short tests/unit_tests/inference/engines/test_dynamic_engine.py::TestDynamicInferenceEngine::test_async_scheduling_decode_only_cuda_graph_e2e
  • /opt/venv/bin/python -m torch.distributed.run --nproc-per-node 8 -m pytest -q -vv --tb=short tests/unit_tests/inference/engines/test_dynamic_engine.py::TestDynamicInferenceEngineParallel::test_async_scheduling_decode_only_moe_ep_cuda_graph_e2e
  • ENABLE_ASYNC_SCHEDULING=1 /opt/venv/bin/python lawrence/benchmark.py --model nano-v3-mtp-300m --num-unique-prompts 1 --num-duplicates 1 --prompt synth 4 4 --num-tokens-to-generate 4 --cuda-graphs --num-speculative-tokens 2 --no-nsys --buffer-size-gb 1 --mamba-buffer-size-gb 1 --max-requests 8 --max-tokens 128

Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations.

Commit 1: Add async decode scheduling scaffolding
- Added the config and command-line switch for async scheduling.
- Added context and controller state for decode-only async setup.
- Added focused tests for the new config and context behavior.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations.

Commit 2: Add async scheduling decode e2e coverage
- Added end-to-end decode coverage with async scheduling enabled.
- Checked that the engine still produces the expected decode output.
- Added test hooks for the async controller path.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations.

Commit 3: Require full-iteration graphs for async scheduling
- Blocked async scheduling unless full-iteration CUDA graphs are enabled.
- Kept non-graph decode steps on the normal synchronous path.
- Made the async eligibility check stricter and safer.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations.

Commit 4: Capture async decode graph replay
- Captured the async decode work so it can be replayed later.
- Stored the replay state needed by the controller.
- Added tests for the captured replay path.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations.

Commit 5: Launch chained async decode graphs
- Launched the next async decode graph earlier in the step.
- Reduced the host gap between sampling and the next forward pass.
- Added tests that cover chained async graph launches.
Plan 1: Add decode-only async scheduling for the GPT greedy path by gating eligible steps, replaying a captured sample/H2D/forward CUDA graph, chaining the next graph launch, and removing hidden timing synchronizations.

Commit 6: Avoid single-rank CUDA sync in inference timing
- Changed single-rank timing to avoid a CUDA synchronization.
- Kept benchmark timing from breaking GPU and CPU overlap.
- Preserved the existing synchronized timing behavior where needed.
Plan 2: Extend async scheduling to hybrid MTP decode-only runs by sharing metadata preparation, enabling hybrid/Mamba plus MoE/EP graph-safe paths, and adding MTP async parity coverage.

Commit 1: Share decode metadata preparation
- Pulled shared decode metadata setup into reusable context code.
- Let sync and async decode use the same preparation logic.
- Removed duplicate metadata-building logic from the context path.
Plan 2: Extend async scheduling to hybrid MTP decode-only runs by sharing metadata preparation, enabling hybrid/Mamba plus MoE/EP graph-safe paths, and adding MTP async parity coverage.

Commit 2: Enable async decode for hybrid and MoE
- Allowed async decode to run for hybrid and MoE model paths.
- Added the metadata handling those model paths need during decode.
- Added tests for hybrid and MoE async decode eligibility.
Plan 2: Extend async scheduling to hybrid MTP decode-only runs by sharing metadata preparation, enabling hybrid/Mamba plus MoE/EP graph-safe paths, and adding MTP async parity coverage.

Commit 3: Enable async decode for MTP
- Added async decode handling for MTP generation.
- Wired hybrid model graph support needed by the MTP path.
- Added context and engine tests for MTP async decode behavior.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Plan 3: Tighten the current F->S->H2D->F async path by avoiding sample-copy waits and moving the MTP launch-critical tail toward a prequeued graph packet without attempting full F->S->F.

Commit 1: Add async decode sample buffer ring
- Added a small ring of sample buffers for async decode.
- Avoided waiting on the previous sample copy before launching more work.
- Added tests that cover sample buffer rotation.
Plan 3: Tighten the current F->S->H2D->F async path by avoiding sample-copy waits and moving the MTP launch-critical tail toward a prequeued graph packet without attempting full F->S->F.

Commit 2: Capture async MTP forward graphs
- Captured MTP forward work for async graph replay.
- Reused captured MTP graphs in the async decode path.
- Added tests for async MTP graph capture and replay.
Plan 3: Tighten the current F->S->H2D->F async path by avoiding sample-copy waits and moving the MTP launch-critical tail toward a prequeued graph packet without attempting full F->S->F.

Commit 3: Defer MTP block release after async launch
- Delayed MTP block release until after the async launch uses needed state.
- Prevented async MTP work from losing blocks too early.
- Added test coverage for the deferred release path.
@lmcafee-nvidia lmcafee-nvidia force-pushed the context-cpu-async-schedule-weekend branch 4 times, most recently from ad82a9c to 88770c5 Compare May 4, 2026 18:23
Plan 4: Extend async scheduling across decode-only boundary conditions by speculating safe request finishes and KV-cache transitions, then reconciling CPU state after the speculative launch.

Commit 1: Preserve async forward rows across finishes
- Recorded the active request order for each speculative async forward.
- Sampled row-mapped pending logits correctly after finished requests were compacted.
- Added GPT finish-boundary coverage that compares async output to serial output.
Plan 4: Extend async scheduling across decode-only boundary conditions by speculating safe request finishes and KV-cache transitions, then reconciling CPU state after the speculative launch.

Commit 2: Reserve KV blocks for async decode boundaries
- Reserved one-step-ahead KV blocks before speculative decode forwards that cross a block boundary.
- Adopted reserved blocks for continuing requests and deferred unused block release until the forward retired.
- Added context and GPT engine tests for boundary reservation, adoption, and deferred release.
Plan 4: Extend async scheduling across decode-only boundary conditions by speculating safe request finishes and KV-cache transitions, then reconciling CPU state after the speculative launch.

Commit 3: Cover async boundary reconciliation cases
- Exhaustively tested two-request boundary adoption and deferred release combinations.
- Added staggered request-admission coverage while async forwards are pending.
- Confirmed boundary and admission behavior still matches serial decode outputs.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 1: Add async lifecycle test harness.
- Add reusable lifecycle counters for finish, MTP finish, pause, evict, and add deferral.
- Add baseline tests that pin the current pause and MTP finish disable behavior.
- Add parity and pending-forward harness coverage for later lifecycle commits.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 2: Defer request admission across pending async forwards.
- Hold new waiting requests out of the context until the pending async forward is consumed or rejected.
- Preserve the speculative forward for the old active decode set when staggered arrivals occur.
- Resume normal admission immediately after async reconciliation completes.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 3: Complete GPT finish reconciliation.
- Accept pending async forwards when GPT requests finish by length or termination id.
- Discard extra sampled tokens for finished rows while preserving continuing rows.
- Cover mixed continue/finish row maps and all-finished cleanup.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 4: Reconcile MTP finish boundaries.
- Reuse row-mapped pending MTP forwards when finished rows leave the active set.
- Allow MTP length and termination finishes without disabling async scheduling.
- Trim accepted speculative tokens at true termination boundaries.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 5: Reconcile speculative pause boundaries.
- Keep async scheduling enabled for active rows while paused rows remain in the context.
- Prepare decode-only async metadata from the active slice after paused requests move left.
- Cover GPT and hybrid MTP pause token preservation and async continuation.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 6: Reconcile speculative eviction boundaries.
- Clear stale paused token buffers when eviction drains the paused set.
- Preserve async pending forwards when evicted rows disappear from the active request order.
- Cover GPT and hybrid MTP eviction boundaries, KV release, and Mamba slot release.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 7: Cover lifecycle boundary interactions.
- Combine add, pause, evict, finish, and KV block transitions in focused interaction tests.
- Verify GPT and hybrid MTP async outputs match serial outputs across staggered arrivals and memory pressure.
- Assert async scheduling remains active across all supported lifecycle boundaries.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 8: Skip idle MTP termination sync.
- Avoid copying accepted MTP tokens to CPU when all active requests have token-id termination disabled.
- Keep accepted speculative-token termination handling for requests that do use a termination id.
- Add focused tests that guard the no-sync fast path and the termination-hit correctness path.
Plan 5: Complete decode-only async lifecycle boundary reconciliation for add, pause, evict, and GPT/MTP finish without invalidating in-flight speculative forwards.

Commit 9: Gate async add barriers by capacity.
- Keep async decode chained while waiting requests cannot fit in the context.
- Only break the async chain when a waiting request can actually be admitted.
- Add focused coverage for the full-context waiting-request case.
Plan 8: Support generated-token logprobs and top-n logprobs without disabling async decode scheduling.

Commit 6: Avoid generated-logprob metadata scans on requests that never asked for logprobs.
- Request admission records whether any request needs generated logprob bookkeeping.
- The async decode hot path skips the extra metadata scan when no logprob request has been seen.
- A focused test covers the no-logprob fast gate and verifies logprob requests still enable the scan.
Plan 8: Support generated-token logprobs and top-n logprobs without disabling async decode scheduling.

Commit 7: Skip sampling/logprob bookkeeping on async steps unless generated logprobs need it.
- Async and pending-forward steps without logprob requests now keep the original fast path.
- Generated-logprob requests still collect bookkeeping when logits are available for calculation.
- Focused tests cover the no-logprob skip path, launched-sample skip path, and generated-logprob collect path.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation.

Commit 1: Generalize async sampling without enabling non-greedy requests yet.
- Added a shared helper that samples into the next-step input buffer.
- Kept the greedy fast path for existing async decode graph launches.
- Forwarded async decode graph test config so graph-focused tests exercise the intended path.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation.

Commit 2: Allow top-k requests to use async forward overlap.
- Removed the top-k eligibility block for decode-only async scheduling.
- Kept top-k off the greedy captured decode graph so random sampling uses eager sampling plus async forward.
- Added focused GPT tests that compare serial and async top-k output.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation.

Commit 3: Allow top-p requests to use async forward overlap.
- Removed the top-p eligibility block for decode-only async scheduling.
- Routed top-p sampling through eager sampling and the existing async forward path.
- Added focused GPT tests that compare serial and async top-p output.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation.

Commit 4: Make dynamic non-greedy RNG independent of batch scheduling.
- Added per-request CUDA generators for dynamic non-greedy sampling.
- Reused each request's generator across top-k, top-p, and speculative sampling helpers.
- Added a staggered-add test that catches RNG drift when async defers admission.
Plan 9: Support async decode for non-greedy sampling modes while preserving deterministic reconciliation.

Commit 5: Validate non-greedy async scheduling for hybrid MTP decode.
- Added hybrid MTP top-k and top-p parity tests for async decode.
- Checked that MTP async eligibility allows non-greedy sampling.
- Confirmed hybrid MTP uses async forward overlap without the greedy decode graph.
Plan 10: Make prompt and echo logprob requests compatible with async decode after prefill.

Commit 1: Stop treating completed prompt logprob work as a decode-time async blocker.
- Removed the decode-only async gate that rejected prompt logprob metadata after prefill.
- Kept non-decode steps ineligible so prompt logprobs are still collected during prefill.
- Updated the eligibility matrix for GPT and hybrid MTP prompt logprob requests.
Plan 10: Make prompt and echo logprob requests compatible with async decode after prefill.

Commit 2: Validate async decode resumes after prompt logprob prefill.
- Added a GPT parity test with prompt and generated logprobs enabled.
- Verified prompt logprobs are collected before decode-only async scheduling starts.
- Confirmed generated tokens and logprobs match the serial path after async resume.
Plan 10: Make prompt and echo logprob requests compatible with async decode after prefill.

Commit 3: Validate prompt top-N logprob compatibility with async decode.
- Added async GPT parity coverage for prompt and generated top-N logprobs.
- Verified prompt top-N results are collected during prefill before async decode resumes.
- Tightened the existing dynamic top-N test to avoid stochastic EOS and sampling mismatches.
Plan 10: Make prompt and echo logprob requests compatible with async decode after prefill.

Commit 4: Validate generate API prompt logprob compatibility with async decode.
- Added generate API parity coverage for prompt and generated logprobs with top-N results.
- Let the generate API test helper build a full-prompt-logit context when prompt logprobs are requested.
- Confirmed async decode resumes after prompt logprob prefill without using the unsupported decode graph path.
Plan 11: Allow nonzero inference logging intervals to use async scheduling by making logged steps explicit pipeline barriers instead of a global async disable.

Commit 1: Add a controller-level barrier for temporary async scheduling breaks.
- Added set and clear helpers for per-step async scheduling barriers.
- Stopped treating every nonzero logging interval as a global async disable.
- Added focused eligibility tests for logging intervals and barrier diagnostics.
Plan 11: Allow nonzero inference logging intervals to use async scheduling by making logged steps explicit pipeline barriers instead of a global async disable.

Commit 2: Wire logging steps into temporary async scheduling barriers.
- Added engine-side barriers before and during timed logging steps.
- Let async scheduling resume on decode steps outside the logging barrier window.
- Added focused GPT, hybrid MTP, and interval-one tests for logging-enabled async decode.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant