[MXFP8]Update param buffer before AG in eval#3727
Conversation
Signed-off-by: qiyuw <qiyuw@nvidia.com>
|
/ok to test 1f1353f |
|
@WanZzzzzz feels like it could use some tests to make sure the behvior is correct |
| if energy_monitor is not None: | ||
| energy_monitor.pause() | ||
| timers("interval-time").stop() | ||
| if config.optimizer.reuse_grad_buf_for_mxfp8_param_ag and config.ddp.overlap_param_gather: |
There was a problem hiding this comment.
codecov is failing. Please add a test.
One example for a recipe test https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/tests/functional_tests/test_groups/recipes/test_deepseek_recipes_pretrain.py#L55-L56
|
@yaoyu-33 @gautham-kollu The new MCore tests explicitly simulate train -> eval -> train with overlap_param_gather enabled and compare both train and eval loss against the non-param-gather reference: MXFP8: tests/unit_tests/test_fp8_param.py::TestFP8Param::test_mxfp8_eval_transition |
|
/ok to test f71fe72 |
|
training code coverage is tested in mcore |
|
Test code is here: |
What does this PR do ?
Since we do a forced-sync param AG in eval, we need to copy the main param to param buffer before AG to ensure we gather the updated weights.
A related fix in Megatron-LM (NVIDIA/Megatron-LM#4563): Also in eval, after param AG, we need to copy the values in param buffer back to model weights (param.data) to update the weights. This is required for mxfp8+fp8-param-gather+reuse-grad-buff-for-mxfp8-param-ag. If we do not do this, the eval step will use the staled weights( those from last iteration).
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information