llama31 405b gb200 nvfp4 no pg overlap#3713
Conversation
|
Light Code Review -- Clean, targeted OOM workaround for Llama 3.1 405B GB200 NVFP4 V2. The disable_param_gather_overlap helper is reusable and placed at the right scope. -- One potential issue: Missing VR200 coverage. LLAMA31_405B_PRETRAIN_CONFIG_VR200_NVFP4_V2 is aliased to LLAMA31_405B_PRETRAIN_CONFIG_GB200_NVFP4_V2 (same 256-GPU / PP=16 / GBS=1536 parallelism config), so the memory profile is identical. However, llama31_405b_pretrain_config_vr200() does not call disable_param_gather_overlap for nvfp4 + v2, which means VR200 NVFP4 V2 runs will still OOM. If VR200 shares the OOM behavior, the same guard should be added there (lines 154-156 of llama31_llm_pretrain.py). -- Minor note: Setting cfg.ddp.overlap_param_gather and cfg.optimizer.overlap_param_gather in disable_param_gather_overlap is technically redundant since comm_overlap.setup() overwrites both. Informational only. -- Suggested test cases: llama31_405b_256gpu_gb200_nvfp4_v2_perf, llama31_405b_256gpu_vr200_nvfp4_v2_perf, llama31_405b_128gpu_gb200_nvfp4_v1_perf |
Signed-off-by: Malay Nagda <malayn@nvidia.com>
8c4c7fe to
151d3dc
Compare
Light Code Review -- Observations -- Clean, targeted OOM workaround for Llama 3.1 405B GB200 NVFP4 V2. The disable_param_gather_overlap helper is reusable and placed at the right scope. -- Missing VR200 coverage: LLAMA31_405B_PRETRAIN_CONFIG_VR200_NVFP4_V2 is aliased to LLAMA31_405B_PRETRAIN_CONFIG_GB200_NVFP4_V2 (same 256-GPU, PP=16, GBS=1536 config). However, llama31_405b_pretrain_config_vr200() does not call disable_param_gather_overlap for nvfp4+v2, so VR200 NVFP4 V2 runs will still OOM. Consider adding the same guard there (lines 154-156 of llama31_llm_pretrain.py). -- Minor note: Setting cfg.ddp.overlap_param_gather and cfg.optimizer.overlap_param_gather in disable_param_gather_overlap is technically redundant since comm_overlap.setup() overwrites both from comm_overlap values (comm_overlap.py:628-630). Informational only. -- Suggested test cases -- llama31_405b_256gpu_gb200_nvfp4_v2_perf (directly impacted) -- llama31_405b_256gpu_vr200_nvfp4_v2_perf (aliased config, confirm VR200 needs same fix) -- llama31_405b_128gpu_gb200_nvfp4_v1_perf (V1 regression check) |
|
/ok to test 151d3dc |
| cfg.comm_overlap.tp_comm_overlap_cfg = comm_overlap_cfg | ||
| cfg.comm_overlap.tp_comm_overlap = False if precision == "nvfp4" else cfg.comm_overlap.tp_comm_overlap | ||
| if precision == "nvfp4" and config_variant.lower() == "v2": | ||
| disable_param_gather_overlap(cfg) |
There was a problem hiding this comment.
Signed-off-by: Malay Nagda <malayn@nvidia.com>
|
/ok to test a8a6ed4 |
What does this PR do ?
WAR for Llama3.1 405B GB200 NVFP4 to avoid OOM
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information