Skip to content

fix(perf): halve MBS on b300/gb300 MoE perf configs to fit HybridEP IB QP cap#3768

Closed
ko3n1g wants to merge 1 commit into
mainfrom
ko3n1g/fix/hybridep-ib-cap-mbs-halve
Closed

fix(perf): halve MBS on b300/gb300 MoE perf configs to fit HybridEP IB QP cap#3768
ko3n1g wants to merge 1 commit into
mainfrom
ko3n1g/fix/hybridep-ib-cap-mbs-halve

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented May 10, 2026

Claude summary

What

Halve micro_batch_size on the b300 / gb300 MoE perf base configs for nemotron_3_nano, qwen3_30b_a3b, and qwen3_vl_30b_a3b.

Why

NVIDIA/Megatron-LM PR #4094 (commit a08e259f32, 2026-05-08) added a deliberate Python-side guardrail in megatron/core/transformer/moe/fused_a2a.py:

# DeepEP calculates tx_depth = 3 * num_tokens + 1.
# InfiniBand strictly asserts tx_depth < 65536.
tx_depth = 3 * num_tokens + 1
if tx_depth >= 65536:
    raise ValueError(
        f"HybridEP RDMA Queue Pair depth ({tx_depth}) exceeds the InfiniBand "
        f"hardware limit of 65535. ..."
    )

This means num_tokens ≤ 21844 per rank is now a hard constraint ((65536-1) // 3 = 21845).

Nine perf tests in our nightly fall on the wrong side of that cap — all with num_tokens = 32768 (tx_depth = 98305):

Config MBS (current) seq_len TP CP num_tokens After fix
qwen3_30b_a3b_64gpu_b300_{bf16,fp8_mx} 8 4096 1 1 32768 16384
qwen3_30b_a3b_64gpu_gb300_{bf16,fp8_mx} 8 4096 1 1 32768 16384
qwen3_vl_30b_a3b_8gpu_gb300_bf16 8 4096 1 1 32768 16384
nemotron_3_nano_8gpu_b300_{bf16,fp8_mx} 4 8192 1 1 32768 16384
nemotron_3_nano_8gpu_gb300_{bf16,fp8_mx} 4 8192 1 1 32768 16384

The corresponding gb200 / b200 variants of these models already use the lower MBS and are unaffected by the new check — that's the proof point that the lower-MBS geometry works for these recipes.

Alternatives considered

Option After fix num_tokens Why not
Halve MBS ✅ (this PR) 16384 Smallest diff; preserves the EP=8/TP=1 perf-tuning point chosen for these configs; matches what the gb200 variants already do
Set TP=2 16384 Crosses NVLink domain only above 8 GPUs (fits here single-node), but shifts the workload geometry; less appropriate for an MoE recipe deliberately tuned at TP=1
Set CP=2 16384 Pure sequence-axis split, but less-validated path for these models; risk of regression

Test plan

  • Run tests label triggers MBridge unit tests
  • Re-run the affected internal nightly tests against this branch (via --mbridge-ref); confirm they no longer trip the IB cap
  • Re-baseline golden values for the 9 affected tests in nemo-ci (separate MR there)

Affected presets

nemotron_3_workload_base_configs.py:
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_BF16_V1     MBS 4 -> 2
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_FP8_MX_V1   (alias)
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_NVFP4_V1    (alias)
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_BF16_V1      MBS 4 -> 2
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_FP8_MX_V1    (alias)
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_NVFP4_V1     (alias)

qwen3_workload_base_configs.py:
  QWEN3_30B_A3B_PRETRAIN_CONFIG_GB300_BF16_V1       MBS 8 -> 4
  QWEN3_30B_A3B_PRETRAIN_CONFIG_GB300_FP8_CS_V1     MBS 8 -> 4
  QWEN3_30B_A3B_PRETRAIN_CONFIG_GB300_FP8_MX_V1     (alias)
  QWEN3_30B_A3B_PRETRAIN_CONFIG_B300_BF16_V1        MBS 8 -> 4
  QWEN3_30B_A3B_PRETRAIN_CONFIG_B300_FP8_CS_V1      MBS 8 -> 4
  QWEN3_30B_A3B_PRETRAIN_CONFIG_B300_FP8_MX_V1      (alias)

qwen3_vl_workload_base_configs.py:
  QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_BF16       MBS 8 -> 4
  QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_FP8_CS     MBS 8 -> 4
  QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_FP8_MX     (alias)

Related

  • Upstream: NVIDIA/Megatron-LM #4094 (the guardrail)
  • Internal triage MR: dl/joc/nemo-ci!2324
  • Internal bucket issue: dl/joc/nemo-ci#4037

…it HybridEP IB QP cap

NVIDIA/Megatron-LM PR #4094 (commit a08e259f32) added a Python-side guardrail in
megatron/core/transformer/moe/fused_a2a.py that rejects HybridEP dispatch when
3*num_tokens + 1 >= 65536 (the InfiniBand RDMA QP-depth hardware limit). The 9
b300/gb300 perf tests below were tripping it because their per-rank num_tokens
landed at 32768 (cap is 21844).

Halving micro_batch_size brings num_tokens down to 16384 — comfortably below the
cap and matching the geometry the gb200 variants already use (which still pass).

Affected presets:
  - NEMOTRON_3_NANO_PRETRAIN_CONFIG_{GB300,B300}_{BF16,FP8_MX,NVFP4}_V1   MBS 4 -> 2
  - QWEN3_30B_A3B_PRETRAIN_CONFIG_{GB300,B300}_{BF16,FP8_CS,FP8_MX}_V1   MBS 8 -> 4
  - QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_{BF16,FP8_CS,FP8_MX}          MBS 8 -> 4

Note: golden values for the corresponding nemo-ci tests will need to be
re-baselined once these new configs run cleanly — that's a follow-up MR in
nemo-ci, not part of this PR.

Signed-off-by: oliver könig <okoenig@nvidia.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 10, 2026

Light Review

LGTM — straightforward MBS reduction to stay under the HybridEP IB QP depth cap.

Verified:

  • All alias configs (FP8_MX, NVFP4, FP8_CS) correctly inherit the new MBS via assignment — no stale values.
  • GB200/B200 variants already use MBS=2 (nemotron_3_nano) and MBS=4 (qwen3_30b_a3b), confirming the target values are valid.
  • Post-fix num_tokens = MBS * seq_len = 16384 gives tx_depth = 49153, well under the 65535 cap.

Minor nit (PR description only, not code): The table in the PR body lists qwen3_30b_a3b_64gpu_b300 and qwen3_30b_a3b_64gpu_gb300, but the actual configs have num_gpus=8. Looks like a copy-paste from the 235B configs.

Suggested test cases:

All affected configs need golden-value re-baselining since MBS changed:

  • nemotron_3_nano_8gpu_gb300_bf16_perf
  • nemotron_3_nano_8gpu_gb300_fp8_mx_perf
  • nemotron_3_nano_8gpu_gb300_nvfp4_perf
  • nemotron_3_nano_8gpu_b300_bf16_perf
  • nemotron_3_nano_8gpu_b300_fp8_mx_perf
  • nemotron_3_nano_8gpu_b300_nvfp4_perf
  • qwen3_30b_a3b_8gpu_gb300_bf16_perf
  • qwen3_30b_a3b_8gpu_gb300_fp8_cs_perf
  • qwen3_30b_a3b_8gpu_gb300_fp8_mx_perf
  • qwen3_30b_a3b_8gpu_b300_bf16_perf
  • qwen3_30b_a3b_8gpu_b300_fp8_cs_perf
  • qwen3_30b_a3b_8gpu_b300_fp8_mx_perf
  • qwen3_vl_30b_a3b_8gpu_gb300_bf16_perf
  • qwen3_vl_30b_a3b_8gpu_gb300_fp8_cs_perf
  • qwen3_vl_30b_a3b_8gpu_gb300_fp8_mx_perf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant