llama31 405b gb200 nvfp4 no pg overlap by malay-nagda · Pull Request #3713 · NVIDIA-NeMo/Megatron-Bridge

malay-nagda · 2026-05-06T17:20:51Z

What does this PR do ?

WAR for Llama3.1 405B GB200 NVFP4 to avoid OOM

Changelog

    cfg.ddp.overlap_param_gather = False
    cfg.optimizer.overlap_param_gather = False
    cfg.comm_overlap.overlap_param_gather = False
    cfg.comm_overlap.align_param_gather = False

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

claude · 2026-05-06T17:27:36Z

Light Code Review -- Clean, targeted OOM workaround for Llama 3.1 405B GB200 NVFP4 V2. The disable_param_gather_overlap helper is reusable and placed at the right scope. -- One potential issue: Missing VR200 coverage. LLAMA31_405B_PRETRAIN_CONFIG_VR200_NVFP4_V2 is aliased to LLAMA31_405B_PRETRAIN_CONFIG_GB200_NVFP4_V2 (same 256-GPU / PP=16 / GBS=1536 parallelism config), so the memory profile is identical. However, llama31_405b_pretrain_config_vr200() does not call disable_param_gather_overlap for nvfp4 + v2, which means VR200 NVFP4 V2 runs will still OOM. If VR200 shares the OOM behavior, the same guard should be added there (lines 154-156 of llama31_llm_pretrain.py). -- Minor note: Setting cfg.ddp.overlap_param_gather and cfg.optimizer.overlap_param_gather in disable_param_gather_overlap is technically redundant since comm_overlap.setup() overwrites both. Informational only. -- Suggested test cases: llama31_405b_256gpu_gb200_nvfp4_v2_perf, llama31_405b_256gpu_vr200_nvfp4_v2_perf, llama31_405b_128gpu_gb200_nvfp4_v1_perf

Signed-off-by: Malay Nagda <malayn@nvidia.com>

copy-pr-bot · 2026-05-06T17:28:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-05-06T17:29:49Z

Light Code Review -- Observations -- Clean, targeted OOM workaround for Llama 3.1 405B GB200 NVFP4 V2. The disable_param_gather_overlap helper is reusable and placed at the right scope. -- Missing VR200 coverage: LLAMA31_405B_PRETRAIN_CONFIG_VR200_NVFP4_V2 is aliased to LLAMA31_405B_PRETRAIN_CONFIG_GB200_NVFP4_V2 (same 256-GPU, PP=16, GBS=1536 config). However, llama31_405b_pretrain_config_vr200() does not call disable_param_gather_overlap for nvfp4+v2, so VR200 NVFP4 V2 runs will still OOM. Consider adding the same guard there (lines 154-156 of llama31_llm_pretrain.py). -- Minor note: Setting cfg.ddp.overlap_param_gather and cfg.optimizer.overlap_param_gather in disable_param_gather_overlap is technically redundant since comm_overlap.setup() overwrites both from comm_overlap values (comm_overlap.py:628-630). Informational only. -- Suggested test cases -- llama31_405b_256gpu_gb200_nvfp4_v2_perf (directly impacted) -- llama31_405b_256gpu_vr200_nvfp4_v2_perf (aliased config, confirm VR200 needs same fix) -- llama31_405b_128gpu_gb200_nvfp4_v1_perf (V1 regression check)

malay-nagda · 2026-05-06T17:52:19Z

/ok to test 151d3dc

yaoyu-33 · 2026-05-07T02:34:56Z

    cfg.comm_overlap.tp_comm_overlap_cfg = comm_overlap_cfg
    cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap
+    if precision == "nvfp4" and config_variant.lower() == "v2":
+        disable_param_gather_overlap(cfg)


@malay-nagda can you comment why need disable?

@yaoyu-33 added to doc string- https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/3713/changes#diff-549b27caa8a6b59ed45c30e07ff12a5981dd32f1d2b7b84f7bbca7186c9c9d7fR48

Signed-off-by: Malay Nagda <malayn@nvidia.com>

yaoyu-33 · 2026-05-11T17:36:16Z

/ok to test a8a6ed4

Disable param gather overlap for Llama3.1 405B GB200 NVFP4

151d3dc

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda force-pushed the malay/llama31_405b_gb200_nvfp4_no_pg_overlap_20260506 branch from 8c4c7fe to 151d3dc Compare May 6, 2026 17:28

malay-nagda changed the title ~~Malay/llama31 405b gb200 nvfp4 no pg overlap 20260506~~ llama31 405b gb200 nvfp4 no pg overlap May 6, 2026

malay-nagda requested review from dingqingy-nv and ko3n1g May 6, 2026 17:33

yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label May 7, 2026

yaoyu-33 reviewed May 7, 2026

View reviewed changes

doc string

a8a6ed4

Signed-off-by: Malay Nagda <malayn@nvidia.com>

malay-nagda requested a review from yaoyu-33 May 11, 2026 08:44

ko3n1g approved these changes May 11, 2026

View reviewed changes

yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels May 11, 2026

yaoyu-33 approved these changes May 11, 2026

View reviewed changes

malay-nagda merged commit 8339cbb into main May 11, 2026
39 checks passed

malay-nagda deleted the malay/llama31_405b_gb200_nvfp4_no_pg_overlap_20260506 branch May 11, 2026 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama31 405b gb200 nvfp4 no pg overlap#3713

llama31 405b gb200 nvfp4 no pg overlap#3713
malay-nagda merged 2 commits into
mainfrom
malay/llama31_405b_gb200_nvfp4_no_pg_overlap_20260506

malay-nagda commented May 6, 2026

Uh oh!

claude Bot commented May 6, 2026

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

claude Bot commented May 6, 2026

Uh oh!

malay-nagda commented May 6, 2026

Uh oh!

yaoyu-33 May 7, 2026

Uh oh!

malay-nagda May 11, 2026 •

edited

Loading

Uh oh!

yaoyu-33 commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

malay-nagda commented May 6, 2026

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

claude Bot commented May 6, 2026

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

claude Bot commented May 6, 2026

Uh oh!

malay-nagda commented May 6, 2026

Uh oh!

yaoyu-33 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

malay-nagda May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

malay-nagda May 11, 2026 •

edited

Loading