fix(perf): halve MBS on b300/gb300 MoE perf configs to fit HybridEP IB QP cap#3768
Closed
ko3n1g wants to merge 1 commit into
Closed
fix(perf): halve MBS on b300/gb300 MoE perf configs to fit HybridEP IB QP cap#3768ko3n1g wants to merge 1 commit into
ko3n1g wants to merge 1 commit into
Conversation
…it HybridEP IB QP cap
NVIDIA/Megatron-LM PR #4094 (commit a08e259f32) added a Python-side guardrail in
megatron/core/transformer/moe/fused_a2a.py that rejects HybridEP dispatch when
3*num_tokens + 1 >= 65536 (the InfiniBand RDMA QP-depth hardware limit). The 9
b300/gb300 perf tests below were tripping it because their per-rank num_tokens
landed at 32768 (cap is 21844).
Halving micro_batch_size brings num_tokens down to 16384 — comfortably below the
cap and matching the geometry the gb200 variants already use (which still pass).
Affected presets:
- NEMOTRON_3_NANO_PRETRAIN_CONFIG_{GB300,B300}_{BF16,FP8_MX,NVFP4}_V1 MBS 4 -> 2
- QWEN3_30B_A3B_PRETRAIN_CONFIG_{GB300,B300}_{BF16,FP8_CS,FP8_MX}_V1 MBS 8 -> 4
- QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_{BF16,FP8_CS,FP8_MX} MBS 8 -> 4
Note: golden values for the corresponding nemo-ci tests will need to be
re-baselined once these new configs run cleanly — that's a follow-up MR in
nemo-ci, not part of this PR.
Signed-off-by: oliver könig <okoenig@nvidia.com>
Contributor
Light ReviewLGTM — straightforward MBS reduction to stay under the HybridEP IB QP depth cap. Verified:
Minor nit (PR description only, not code): The table in the PR body lists qwen3_30b_a3b_64gpu_b300 and qwen3_30b_a3b_64gpu_gb300, but the actual configs have num_gpus=8. Looks like a copy-paste from the 235B configs. Suggested test cases: All affected configs need golden-value re-baselining since MBS changed:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude summary
What
Halve
micro_batch_sizeon the b300 / gb300 MoE perf base configs fornemotron_3_nano,qwen3_30b_a3b, andqwen3_vl_30b_a3b.Why
NVIDIA/Megatron-LM PR #4094 (commit
a08e259f32, 2026-05-08) added a deliberate Python-side guardrail inmegatron/core/transformer/moe/fused_a2a.py:This means
num_tokens ≤ 21844per rank is now a hard constraint ((65536-1) // 3 = 21845).Nine perf tests in our nightly fall on the wrong side of that cap — all with
num_tokens = 32768(tx_depth = 98305):MBS(current)seq_lenTPCPnum_tokensqwen3_30b_a3b_64gpu_b300_{bf16,fp8_mx}qwen3_30b_a3b_64gpu_gb300_{bf16,fp8_mx}qwen3_vl_30b_a3b_8gpu_gb300_bf16nemotron_3_nano_8gpu_b300_{bf16,fp8_mx}nemotron_3_nano_8gpu_gb300_{bf16,fp8_mx}The corresponding
gb200/b200variants of these models already use the lower MBS and are unaffected by the new check — that's the proof point that the lower-MBS geometry works for these recipes.Alternatives considered
num_tokensMBS✅ (this PR)TP=2CP=2Test plan
Run testslabel triggers MBridge unit tests--mbridge-ref); confirm they no longer trip the IB capAffected presets
Related
dl/joc/nemo-ci!2324dl/joc/nemo-ci#4037