Skip to content

[performance] feat: add 405B B200/B300 V2 aliases + 405B GB200 NVFP4 256x gpu scale expendable segments addition due to CUDA OOM issue #3759

Merged
ko3n1g merged 3 commits intomainfrom
rsalagame_b300_b200_init_gb200_405b_nvfp4_256gpus
May 11, 2026
Merged

[performance] feat: add 405B B200/B300 V2 aliases + 405B GB200 NVFP4 256x gpu scale expendable segments addition due to CUDA OOM issue #3759
ko3n1g merged 3 commits intomainfrom
rsalagame_b300_b200_init_gb200_405b_nvfp4_256gpus

Conversation

@rsalagame-nvidia
Copy link
Copy Markdown
Contributor

Summary

Bring 405B B200/B300 V2 pretrain configs to parity with GB200/GB300/H100/VR200 on main. Currently:

  • LLAMA31_405B_*_GB200/GB300/H100_*_V2 aliases exist in llama31_workload_base_configs.py and are re-exported.
  • LLAMA31_405B_*_B200/B300_*_V2 aliases don't exist at all on main, even though the runtime lookup getattr(configs.llama, "LLAMA31_405B_PRETRAIN_CONFIG_B200_FP8_CS_V2") is performed for B200/B300 hosts. The lookup returns None, silently falls back to V1, and produces NaN gradients on B200/B300 405B runs.

(Note: LLAMA3_70B_*_B200/B300_*_V2 is fine on main — full V1/V2 coverage. The gap is purely at 405B.)

Changes

This is a 2-file change on main, mirroring how VR200 V2 is already done in this file and matching what llmb-r0.4.0 already has:

  • scripts/performance/configs/llama/llama31_workload_base_configs.py: define 8 V2 aliases (B*_V2 = GB*_V2) after the existing VR200 V2 block, and add 8 matching strings to that file's __all__.
  • scripts/performance/configs/llama/__init__.py: import the 8 V2 names interleaved alphabetically with the V1 entries, and add 8 matching strings to __all__ between GB300_NVFP4_V2 and H100_BF16_V2.

No changes to llama3_workload_base_configs.py, lookup logic in utils/utils.py, or any 70B/8B entries.

Why "B*_V2 = GB*_V2" aliases (not independent definitions)?

Same pattern already used for VR200 V2 in this file. B200/B300 share their tuning targets with GB200/GB300 V2 at this scale (num_gpus=256, GBS=1536); a thin alias keeps the config surface in lock-step. This also matches what landed on llmb-r0.4.0 for the same gap.

Test plan

  • Both files byte-compile cleanly (python -m py_compile)
  • AST scan confirms all 8 expected V2 names are defined and listed in __all__ in llama31_workload_base_configs.py, and imported and listed in __all__ in __init__.py
  • On a CUDA host: python -c "from configs.llama import LLAMA31_405B_PRETRAIN_CONFIG_B200_FP8_CS_V2; print(LLAMA31_405B_PRETRAIN_CONFIG_B200_FP8_CS_V2)" from scripts/performance/ prints a WorkloadBaseConfig (the GB200 V2 alias) instead of ImportError
  • 405B B200 FP8_CS V2 pretrain run no longer silently falls back to V1 / produces NaN gradients

… configs.llama

Signed-off-by: Rahul Salagame <rsalagame@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…raints on GB200. Performance remains same.

Signed-off-by: Rahul Salagame <rsalagame@nvidia.com>
@rsalagame-nvidia rsalagame-nvidia changed the title [performance] feat: add 405B B200/B300 V2 aliases + re-export them in… [performance] feat: add 405B B200/B300 V2 aliases + 405B GB200 NVFP4 256x gpu scale expendable segments addition due to CUDA OOM issue May 8, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

LGTM - clean minimal change adding 8 V2 aliases for B200/B300 following the existing VR200 pattern. No bugs or issues found. No perf tests impacted. Consider extending test_llama31_405b_perf_config_instantiation to cover B200 and B300 with config_variant v2.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 8, 2026

Detailed notes: The 8 new aliases (B200_V2 = GB200_V2, B300_V2 = GB300_V2) mirror the VR200 V2 pattern on lines 291-293 of llama31_workload_base_configs.py. Both all lists and init.py imports are consistent. The existing llama31_405b_pretrain_config_b200/b300 functions already accept config_variant=v2 via get_workload_base_config getattr lookup, so this fix ensures that lookup returns the correct config instead of None (which caused silent V1 fallback and NaN gradients). Suggested test cases: No perf tests impacted since only aliases were added. However test_llama31_405b_perf_config_instantiation only covers H100 today - extending it to B200/B300 V2 would catch this class of regression.

Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malay-nagda these are not tracked configs? Will we need to adjust our internal CI to make sure it continues running v1?

Comment thread scripts/performance/perf_plugins.py Outdated
@malay-nagda
Copy link
Copy Markdown
Contributor

@malay-nagda these are not tracked configs? Will we need to adjust our internal CI to make sure it continues running v1?

We do not run anything, v1 or v2 for 405B for both B200 and B300 in CI. So, no need to change in anything in CI.

Signed-off-by: Rahul Salagame <rsalagame@nvidia.com>
@ko3n1g ko3n1g added the docs-only With great power comes great responsibility. label May 11, 2026
@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented May 11, 2026

/ok to test 1aec6cd

@ko3n1g ko3n1g merged commit 2461340 into main May 11, 2026
38 checks passed
@ko3n1g ko3n1g deleted the rsalagame_b300_b200_init_gb200_405b_nvfp4_256gpus branch May 11, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-only With great power comes great responsibility.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants