Skip to content

llama31 405b gb200 nvfp4 no pg overlap#3713

Merged
malay-nagda merged 2 commits into
mainfrom
malay/llama31_405b_gb200_nvfp4_no_pg_overlap_20260506
May 11, 2026
Merged

llama31 405b gb200 nvfp4 no pg overlap#3713
malay-nagda merged 2 commits into
mainfrom
malay/llama31_405b_gb200_nvfp4_no_pg_overlap_20260506

Conversation

@malay-nagda
Copy link
Copy Markdown
Contributor

What does this PR do ?

WAR for Llama3.1 405B GB200 NVFP4 to avoid OOM

Changelog

    cfg.ddp.overlap_param_gather = False
    cfg.optimizer.overlap_param_gather = False
    cfg.comm_overlap.overlap_param_gather = False
    cfg.comm_overlap.align_param_gather = False

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@malay-nagda malay-nagda added performance/release Performance items related with NeMo release area:perf Performance optimizations and benchmarking r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. 26.04.01 docs-only With great power comes great responsibility. labels May 6, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 6, 2026

Light Code Review -- Clean, targeted OOM workaround for Llama 3.1 405B GB200 NVFP4 V2. The disable_param_gather_overlap helper is reusable and placed at the right scope. -- One potential issue: Missing VR200 coverage. LLAMA31_405B_PRETRAIN_CONFIG_VR200_NVFP4_V2 is aliased to LLAMA31_405B_PRETRAIN_CONFIG_GB200_NVFP4_V2 (same 256-GPU / PP=16 / GBS=1536 parallelism config), so the memory profile is identical. However, llama31_405b_pretrain_config_vr200() does not call disable_param_gather_overlap for nvfp4 + v2, which means VR200 NVFP4 V2 runs will still OOM. If VR200 shares the OOM behavior, the same guard should be added there (lines 154-156 of llama31_llm_pretrain.py). -- Minor note: Setting cfg.ddp.overlap_param_gather and cfg.optimizer.overlap_param_gather in disable_param_gather_overlap is technically redundant since comm_overlap.setup() overwrites both. Informational only. -- Suggested test cases: llama31_405b_256gpu_gb200_nvfp4_v2_perf, llama31_405b_256gpu_vr200_nvfp4_v2_perf, llama31_405b_128gpu_gb200_nvfp4_v1_perf

Signed-off-by: Malay Nagda <malayn@nvidia.com>
@malay-nagda malay-nagda force-pushed the malay/llama31_405b_gb200_nvfp4_no_pg_overlap_20260506 branch from 8c4c7fe to 151d3dc Compare May 6, 2026 17:28
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 6, 2026

Light Code Review -- Observations -- Clean, targeted OOM workaround for Llama 3.1 405B GB200 NVFP4 V2. The disable_param_gather_overlap helper is reusable and placed at the right scope. -- Missing VR200 coverage: LLAMA31_405B_PRETRAIN_CONFIG_VR200_NVFP4_V2 is aliased to LLAMA31_405B_PRETRAIN_CONFIG_GB200_NVFP4_V2 (same 256-GPU, PP=16, GBS=1536 config). However, llama31_405b_pretrain_config_vr200() does not call disable_param_gather_overlap for nvfp4+v2, so VR200 NVFP4 V2 runs will still OOM. Consider adding the same guard there (lines 154-156 of llama31_llm_pretrain.py). -- Minor note: Setting cfg.ddp.overlap_param_gather and cfg.optimizer.overlap_param_gather in disable_param_gather_overlap is technically redundant since comm_overlap.setup() overwrites both from comm_overlap values (comm_overlap.py:628-630). Informational only. -- Suggested test cases -- llama31_405b_256gpu_gb200_nvfp4_v2_perf (directly impacted) -- llama31_405b_256gpu_vr200_nvfp4_v2_perf (aliased config, confirm VR200 needs same fix) -- llama31_405b_128gpu_gb200_nvfp4_v1_perf (V1 regression check)

@malay-nagda malay-nagda changed the title Malay/llama31 405b gb200 nvfp4 no pg overlap 20260506 llama31 405b gb200 nvfp4 no pg overlap May 6, 2026
@malay-nagda malay-nagda requested review from dingqingy-nv and ko3n1g May 6, 2026 17:33
@malay-nagda
Copy link
Copy Markdown
Contributor Author

/ok to test 151d3dc

@yaoyu-33 yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label May 7, 2026
cfg.comm_overlap.tp_comm_overlap_cfg = comm_overlap_cfg
cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap
if precision == "nvfp4" and config_variant.lower() == "v2":
disable_param_gather_overlap(cfg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malay-nagda can you comment why need disable?

Copy link
Copy Markdown
Contributor Author

@malay-nagda malay-nagda May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Malay Nagda <malayn@nvidia.com>
@malay-nagda malay-nagda requested a review from yaoyu-33 May 11, 2026 08:44
@yaoyu-33 yaoyu-33 added ready-to-merge PR is approved, current, and only waiting for CI to pass before merge and removed needs-review PR is ready for code review and waiting on a reviewer labels May 11, 2026
@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test a8a6ed4

@malay-nagda malay-nagda merged commit 8339cbb into main May 11, 2026
39 checks passed
@malay-nagda malay-nagda deleted the malay/llama31_405b_gb200_nvfp4_no_pg_overlap_20260506 branch May 11, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

26.04.01 area:perf Performance optimizations and benchmarking docs-only With great power comes great responsibility. performance/release Performance items related with NeMo release r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants