[MXFP8]Update param buffer before AG in eval by WanZzzzzz · Pull Request #3727 · NVIDIA-NeMo/Megatron-Bridge

WanZzzzzz · 2026-05-06T23:18:12Z

What does this PR do ?

Since we do a forced-sync param AG in eval, we need to copy the main param to param buffer before AG to ensure we gather the updated weights.

A related fix in Megatron-LM (NVIDIA/Megatron-LM#4563): Also in eval, after param AG, we need to copy the values in param buffer back to model weights (param.data) to update the weights. This is required for mxfp8+fp8-param-gather+reuse-grad-buff-for-mxfp8-param-ag. If we do not do this, the eval step will use the staled weights( those from last iteration).

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-05-06T23:18:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: qiyuw <qiyuw@nvidia.com>

Phlip79 · 2026-05-07T01:51:48Z

/ok to test 1f1353f

yaoyu-33 · 2026-05-07T02:23:03Z

@WanZzzzzz feels like it could use some tests to make sure the behvior is correct

gautham-kollu · 2026-05-07T17:03:31Z

            if energy_monitor is not None:
                energy_monitor.pause()
            timers("interval-time").stop()
+            if config.optimizer.reuse_grad_buf_for_mxfp8_param_ag and config.ddp.overlap_param_gather:


codecov is failing. Please add a test.

One example for a recipe test https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/tests/functional_tests/test_groups/recipes/test_deepseek_recipes_pretrain.py#L55-L56

WanZzzzzz · 2026-05-07T20:57:50Z

@yaoyu-33 @gautham-kollu
I added coverage in this MCore PR instead of a MBridge recipe test because the behavior being fixed is in MCore post-AG processing path. This fix in Mbridge just ensures the MCore code path will work correctly. Also I looked at the existing MBridge recipe tests, and they don’t currently expose the DDP knobs needed to exercise this bug, specifically overlap_param_gather, fp8_param_gather/fp4_param_gather, and reuse_grad_buf_for_mxfp8_param_ag in a train -> eval -> train flow.

The new MCore tests explicitly simulate train -> eval -> train with overlap_param_gather enabled and compare both train and eval loss against the non-param-gather reference:

MXFP8: tests/unit_tests/test_fp8_param.py::TestFP8Param::test_mxfp8_eval_transition
NVFP4: tests/unit_tests/test_fp4_param.py::TestFP4Param::test_nvfp4_eval_transition

gautham-kollu · 2026-05-08T18:13:17Z

/ok to test f71fe72

cuichenx · 2026-05-08T23:01:26Z

training code coverage is tested in mcore

WanZzzzzz · 2026-05-08T23:02:38Z

Test code is here:

NVIDIA/Megatron-LM#4563

fix eval issue

1f1353f

Signed-off-by: qiyuw <qiyuw@nvidia.com>

WanZzzzzz force-pushed the mxfp8-eval-fix branch from 87005b1 to 1f1353f Compare May 6, 2026 23:39

copy-pr-bot Bot temporarily deployed to public May 7, 2026 01:52 Inactive

Phlip79 requested a review from cuichenx May 7, 2026 01:52

copy-pr-bot Bot temporarily deployed to test May 7, 2026 01:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 02:04 Inactive

copy-pr-bot Bot temporarily deployed to public May 7, 2026 02:05 Inactive

yaoyu-33 added area:training Training loop, callbacks, and runtime integration needs-review PR is ready for code review and waiting on a reviewer labels May 7, 2026

copy-pr-bot Bot temporarily deployed to public May 7, 2026 02:18 Inactive

yaoyu-33 approved these changes May 7, 2026

View reviewed changes

yaoyu-33 added waiting-on-customer Waiting on the original author to respond and removed needs-review PR is ready for code review and waiting on a reviewer labels May 7, 2026

Merge branch 'main' into mxfp8-eval-fix

17e78e4

gautham-kollu reviewed May 7, 2026

View reviewed changes

Merge branch 'main' into mxfp8-eval-fix

f71fe72

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:14 Inactive

copy-pr-bot Bot temporarily deployed to test May 8, 2026 18:14 Inactive

cuichenx added the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label May 8, 2026

cuichenx enabled auto-merge (squash) May 8, 2026 18:28

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:28 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:29 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:44 Inactive

cuichenx disabled auto-merge May 8, 2026 23:00

cuichenx merged commit 71c63da into NVIDIA-NeMo:main May 8, 2026
94 of 95 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXFP8]Update param buffer before AG in eval#3727

[MXFP8]Update param buffer before AG in eval#3727
cuichenx merged 3 commits intoNVIDIA-NeMo:mainfrom
WanZzzzzz:mxfp8-eval-fix

WanZzzzzz commented May 6, 2026

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

Phlip79 commented May 7, 2026

Uh oh!

yaoyu-33 commented May 7, 2026

Uh oh!

gautham-kollu May 7, 2026

Uh oh!

WanZzzzzz commented May 7, 2026

Uh oh!

gautham-kollu commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

Uh oh!

WanZzzzzz commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

WanZzzzzz commented May 6, 2026

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

Phlip79 commented May 7, 2026

Uh oh!

yaoyu-33 commented May 7, 2026

Uh oh!

gautham-kollu May 7, 2026

Choose a reason for hiding this comment

Uh oh!

WanZzzzzz commented May 7, 2026

Uh oh!

gautham-kollu commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

Uh oh!

WanZzzzzz commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants