fix(perf): halve MBS on b300/gb300 MoE perf configs to fit HybridEP IB QP cap by ko3n1g · Pull Request #3768 · NVIDIA-NeMo/Megatron-Bridge

ko3n1g · 2026-05-10T11:16:50Z

Claude summary

What

Halve micro_batch_size on the b300 / gb300 MoE perf base configs for nemotron_3_nano, qwen3_30b_a3b, and qwen3_vl_30b_a3b.

Why

NVIDIA/Megatron-LM PR #4094 (commit a08e259f32, 2026-05-08) added a deliberate Python-side guardrail in megatron/core/transformer/moe/fused_a2a.py:

# DeepEP calculates tx_depth = 3 * num_tokens + 1.
# InfiniBand strictly asserts tx_depth < 65536.
tx_depth = 3 * num_tokens + 1
if tx_depth >= 65536:
    raise ValueError(
        f"HybridEP RDMA Queue Pair depth ({tx_depth}) exceeds the InfiniBand "
        f"hardware limit of 65535. ..."
    )

This means num_tokens ≤ 21844 per rank is now a hard constraint ((65536-1) // 3 = 21845).

Nine perf tests in our nightly fall on the wrong side of that cap — all with num_tokens = 32768 (tx_depth = 98305):

Config	`MBS` (current)	`seq_len`	`TP`	`CP`	`num_tokens`	After fix
`qwen3_30b_a3b_64gpu_b300_{bf16,fp8_mx}`	8	4096	1	1	32768	16384
`qwen3_30b_a3b_64gpu_gb300_{bf16,fp8_mx}`	8	4096	1	1	32768	16384
`qwen3_vl_30b_a3b_8gpu_gb300_bf16`	8	4096	1	1	32768	16384
`nemotron_3_nano_8gpu_b300_{bf16,fp8_mx}`	4	8192	1	1	32768	16384
`nemotron_3_nano_8gpu_gb300_{bf16,fp8_mx}`	4	8192	1	1	32768	16384

The corresponding gb200 / b200 variants of these models already use the lower MBS and are unaffected by the new check — that's the proof point that the lower-MBS geometry works for these recipes.

Alternatives considered

Option	After fix `num_tokens`	Why not
Halve `MBS` ✅ (this PR)	16384	Smallest diff; preserves the EP=8/TP=1 perf-tuning point chosen for these configs; matches what the gb200 variants already do
Set `TP=2`	16384	Crosses NVLink domain only above 8 GPUs (fits here single-node), but shifts the workload geometry; less appropriate for an MoE recipe deliberately tuned at TP=1
Set `CP=2`	16384	Pure sequence-axis split, but less-validated path for these models; risk of regression

Test plan

Run tests label triggers MBridge unit tests
Re-run the affected internal nightly tests against this branch (via --mbridge-ref); confirm they no longer trip the IB cap
Re-baseline golden values for the 9 affected tests in nemo-ci (separate MR there)

Affected presets

nemotron_3_workload_base_configs.py:
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_BF16_V1     MBS 4 -> 2
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_FP8_MX_V1   (alias)
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_GB300_NVFP4_V1    (alias)
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_BF16_V1      MBS 4 -> 2
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_FP8_MX_V1    (alias)
  NEMOTRON_3_NANO_PRETRAIN_CONFIG_B300_NVFP4_V1     (alias)

qwen3_workload_base_configs.py:
  QWEN3_30B_A3B_PRETRAIN_CONFIG_GB300_BF16_V1       MBS 8 -> 4
  QWEN3_30B_A3B_PRETRAIN_CONFIG_GB300_FP8_CS_V1     MBS 8 -> 4
  QWEN3_30B_A3B_PRETRAIN_CONFIG_GB300_FP8_MX_V1     (alias)
  QWEN3_30B_A3B_PRETRAIN_CONFIG_B300_BF16_V1        MBS 8 -> 4
  QWEN3_30B_A3B_PRETRAIN_CONFIG_B300_FP8_CS_V1      MBS 8 -> 4
  QWEN3_30B_A3B_PRETRAIN_CONFIG_B300_FP8_MX_V1      (alias)

qwen3_vl_workload_base_configs.py:
  QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_BF16       MBS 8 -> 4
  QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_FP8_CS     MBS 8 -> 4
  QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_FP8_MX     (alias)

…it HybridEP IB QP cap NVIDIA/Megatron-LM PR #4094 (commit a08e259f32) added a Python-side guardrail in megatron/core/transformer/moe/fused_a2a.py that rejects HybridEP dispatch when 3*num_tokens + 1 >= 65536 (the InfiniBand RDMA QP-depth hardware limit). The 9 b300/gb300 perf tests below were tripping it because their per-rank num_tokens landed at 32768 (cap is 21844). Halving micro_batch_size brings num_tokens down to 16384 — comfortably below the cap and matching the geometry the gb200 variants already use (which still pass). Affected presets: - NEMOTRON_3_NANO_PRETRAIN_CONFIG_{GB300,B300}_{BF16,FP8_MX,NVFP4}_V1 MBS 4 -> 2 - QWEN3_30B_A3B_PRETRAIN_CONFIG_{GB300,B300}_{BF16,FP8_CS,FP8_MX}_V1 MBS 8 -> 4 - QWEN3_VL_30B_A3B_PRETRAIN_CONFIG_GB300_{BF16,FP8_CS,FP8_MX} MBS 8 -> 4 Note: golden values for the corresponding nemo-ci tests will need to be re-baselined once these new configs run cleanly — that's a follow-up MR in nemo-ci, not part of this PR. Signed-off-by: oliver könig <okoenig@nvidia.com>

claude · 2026-05-10T11:21:47Z

Light Review

LGTM — straightforward MBS reduction to stay under the HybridEP IB QP depth cap.

Verified:

All alias configs (FP8_MX, NVFP4, FP8_CS) correctly inherit the new MBS via assignment — no stale values.
GB200/B200 variants already use MBS=2 (nemotron_3_nano) and MBS=4 (qwen3_30b_a3b), confirming the target values are valid.
Post-fix num_tokens = MBS * seq_len = 16384 gives tx_depth = 49153, well under the 65535 cap.

Minor nit (PR description only, not code): The table in the PR body lists qwen3_30b_a3b_64gpu_b300 and qwen3_30b_a3b_64gpu_gb300, but the actual configs have num_gpus=8. Looks like a copy-paste from the 235B configs.

Suggested test cases:

All affected configs need golden-value re-baselining since MBS changed:

nemotron_3_nano_8gpu_gb300_bf16_perf
nemotron_3_nano_8gpu_gb300_fp8_mx_perf
nemotron_3_nano_8gpu_gb300_nvfp4_perf
nemotron_3_nano_8gpu_b300_bf16_perf
nemotron_3_nano_8gpu_b300_fp8_mx_perf
nemotron_3_nano_8gpu_b300_nvfp4_perf
qwen3_30b_a3b_8gpu_gb300_bf16_perf
qwen3_30b_a3b_8gpu_gb300_fp8_cs_perf
qwen3_30b_a3b_8gpu_gb300_fp8_mx_perf
qwen3_30b_a3b_8gpu_b300_bf16_perf
qwen3_30b_a3b_8gpu_b300_fp8_cs_perf
qwen3_30b_a3b_8gpu_b300_fp8_mx_perf
qwen3_vl_30b_a3b_8gpu_gb300_bf16_perf
qwen3_vl_30b_a3b_8gpu_gb300_fp8_cs_perf
qwen3_vl_30b_a3b_8gpu_gb300_fp8_mx_perf

copy-pr-bot Bot temporarily deployed to public May 10, 2026 11:17 Inactive

copy-pr-bot Bot temporarily deployed to public May 10, 2026 11:24 Inactive

copy-pr-bot Bot temporarily deployed to public May 10, 2026 11:25 Inactive

copy-pr-bot Bot temporarily deployed to public May 10, 2026 11:38 Inactive

ko3n1g closed this May 11, 2026

janEbert mentioned this pull request May 11, 2026

Add Python-side guardrail for DeepEP IB limits NVIDIA/Megatron-LM#4719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(perf): halve MBS on b300/gb300 MoE perf configs to fit HybridEP IB QP cap#3768

fix(perf): halve MBS on b300/gb300 MoE perf configs to fit HybridEP IB QP cap#3768
ko3n1g wants to merge 1 commit into
mainfrom
ko3n1g/fix/hybridep-ib-cap-mbs-halve

ko3n1g commented May 10, 2026

Uh oh!

claude Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ko3n1g commented May 10, 2026

What

Why

Alternatives considered

Test plan

Affected presets

Related

Uh oh!

claude Bot commented May 10, 2026

Light Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant