Conversation
… configs.llama Signed-off-by: Rahul Salagame <rsalagame@nvidia.com>
…raints on GB200. Performance remains same. Signed-off-by: Rahul Salagame <rsalagame@nvidia.com>
|
LGTM - clean minimal change adding 8 V2 aliases for B200/B300 following the existing VR200 pattern. No bugs or issues found. No perf tests impacted. Consider extending test_llama31_405b_perf_config_instantiation to cover B200 and B300 with config_variant v2. |
|
Detailed notes: The 8 new aliases (B200_V2 = GB200_V2, B300_V2 = GB300_V2) mirror the VR200 V2 pattern on lines 291-293 of llama31_workload_base_configs.py. Both all lists and init.py imports are consistent. The existing llama31_405b_pretrain_config_b200/b300 functions already accept config_variant=v2 via get_workload_base_config getattr lookup, so this fix ensures that lookup returns the correct config instead of None (which caused silent V1 fallback and NaN gradients). Suggested test cases: No perf tests impacted since only aliases were added. However test_llama31_405b_perf_config_instantiation only covers H100 today - extending it to B200/B300 V2 would catch this class of regression. |
ko3n1g
left a comment
There was a problem hiding this comment.
@malay-nagda these are not tracked configs? Will we need to adjust our internal CI to make sure it continues running v1?
We do not run anything, v1 or v2 for 405B for both B200 and B300 in CI. So, no need to change in anything in CI. |
Signed-off-by: Rahul Salagame <rsalagame@nvidia.com>
|
/ok to test 1aec6cd |
Summary
Bring 405B B200/B300 V2 pretrain configs to parity with GB200/GB300/H100/VR200 on
main. Currently:LLAMA31_405B_*_GB200/GB300/H100_*_V2aliases exist inllama31_workload_base_configs.pyand are re-exported.LLAMA31_405B_*_B200/B300_*_V2aliases don't exist at all on main, even though the runtime lookupgetattr(configs.llama, "LLAMA31_405B_PRETRAIN_CONFIG_B200_FP8_CS_V2")is performed for B200/B300 hosts. The lookup returnsNone, silently falls back to V1, and produces NaN gradients on B200/B300 405B runs.(Note:
LLAMA3_70B_*_B200/B300_*_V2is fine on main — full V1/V2 coverage. The gap is purely at 405B.)Changes
This is a 2-file change on
main, mirroring how VR200 V2 is already done in this file and matching whatllmb-r0.4.0already has:scripts/performance/configs/llama/llama31_workload_base_configs.py: define 8 V2 aliases (B*_V2 = GB*_V2) after the existing VR200 V2 block, and add 8 matching strings to that file's__all__.scripts/performance/configs/llama/__init__.py: import the 8 V2 names interleaved alphabetically with the V1 entries, and add 8 matching strings to__all__betweenGB300_NVFP4_V2andH100_BF16_V2.No changes to
llama3_workload_base_configs.py, lookup logic inutils/utils.py, or any 70B/8B entries.Why "B*_V2 = GB*_V2" aliases (not independent definitions)?
Same pattern already used for VR200 V2 in this file. B200/B300 share their tuning targets with GB200/GB300 V2 at this scale (
num_gpus=256,GBS=1536); a thin alias keeps the config surface in lock-step. This also matches what landed onllmb-r0.4.0for the same gap.Test plan
python -m py_compile)__all__inllama31_workload_base_configs.py, and imported and listed in__all__in__init__.pypython -c "from configs.llama import LLAMA31_405B_PRETRAIN_CONFIG_B200_FP8_CS_V2; print(LLAMA31_405B_PRETRAIN_CONFIG_B200_FP8_CS_V2)"fromscripts/performance/prints aWorkloadBaseConfig(the GB200 V2 alias) instead ofImportError