Skip to content

[MXFP8]Update param buffer before AG in eval#3727

Merged
cuichenx merged 3 commits intoNVIDIA-NeMo:mainfrom
WanZzzzzz:mxfp8-eval-fix
May 8, 2026
Merged

[MXFP8]Update param buffer before AG in eval#3727
cuichenx merged 3 commits intoNVIDIA-NeMo:mainfrom
WanZzzzzz:mxfp8-eval-fix

Conversation

@WanZzzzzz
Copy link
Copy Markdown
Contributor

What does this PR do ?

Since we do a forced-sync param AG in eval, we need to copy the main param to param buffer before AG to ensure we gather the updated weights.

A related fix in Megatron-LM (NVIDIA/Megatron-LM#4563): Also in eval, after param AG, we need to copy the values in param buffer back to model weights (param.data) to update the weights. This is required for mxfp8+fp8-param-gather+reuse-grad-buff-for-mxfp8-param-ag. If we do not do this, the eval step will use the staled weights( those from last iteration).

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: qiyuw <qiyuw@nvidia.com>
@Phlip79
Copy link
Copy Markdown
Member

Phlip79 commented May 7, 2026

/ok to test 1f1353f

@Phlip79 Phlip79 requested a review from cuichenx May 7, 2026 01:52
@yaoyu-33 yaoyu-33 added area:training Training loop, callbacks, and runtime integration needs-review PR is ready for code review and waiting on a reviewer labels May 7, 2026
@yaoyu-33
Copy link
Copy Markdown
Contributor

yaoyu-33 commented May 7, 2026

@WanZzzzzz feels like it could use some tests to make sure the behvior is correct

@yaoyu-33 yaoyu-33 added waiting-on-customer Waiting on the original author to respond and removed needs-review PR is ready for code review and waiting on a reviewer labels May 7, 2026
if energy_monitor is not None:
energy_monitor.pause()
timers("interval-time").stop()
if config.optimizer.reuse_grad_buf_for_mxfp8_param_ag and config.ddp.overlap_param_gather:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WanZzzzzz
Copy link
Copy Markdown
Contributor Author

@yaoyu-33 @gautham-kollu
I added coverage in this MCore PR instead of a MBridge recipe test because the behavior being fixed is in MCore post-AG processing path. This fix in Mbridge just ensures the MCore code path will work correctly. Also I looked at the existing MBridge recipe tests, and they don’t currently expose the DDP knobs needed to exercise this bug, specifically overlap_param_gather, fp8_param_gather/fp4_param_gather, and reuse_grad_buf_for_mxfp8_param_ag in a train -> eval -> train flow.

The new MCore tests explicitly simulate train -> eval -> train with overlap_param_gather enabled and compare both train and eval loss against the non-param-gather reference:

MXFP8: tests/unit_tests/test_fp8_param.py::TestFP8Param::test_mxfp8_eval_transition
NVFP4: tests/unit_tests/test_fp4_param.py::TestFP4Param::test_nvfp4_eval_transition

@gautham-kollu
Copy link
Copy Markdown
Contributor

/ok to test f71fe72

@cuichenx cuichenx added the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label May 8, 2026
@cuichenx cuichenx enabled auto-merge (squash) May 8, 2026 18:28
@cuichenx cuichenx disabled auto-merge May 8, 2026 23:00
@cuichenx
Copy link
Copy Markdown
Contributor

cuichenx commented May 8, 2026

training code coverage is tested in mcore

@cuichenx cuichenx merged commit 71c63da into NVIDIA-NeMo:main May 8, 2026
94 of 95 checks passed
@WanZzzzzz
Copy link
Copy Markdown
Contributor Author

Test code is here:

NVIDIA/Megatron-LM#4563

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration ready-to-merge PR is approved, current, and only waiting for CI to pass before merge waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants