fix(fsdp): recognize legacy GDN TP metadata by Glitchfix · Pull Request #4664 · NVIDIA/Megatron-LM

Glitchfix · 2026-05-07T00:04:28Z

What does this PR do ?

Fix Megatron-FSDP DTensor checkpoint saving for GDN fused tensors whose copied meta tensors only carry legacy Megatron tensor-parallel metadata.

GDN conv1d parameters are annotated with tensor_model_parallel and partition_dim. handle_gdn_in_state_dict() copies those attributes to meta tensors before calling make_fsdp_dtensor(), but the FSDP tensor-parallel detection path only recognized _tensor_parallel_mode. This could make split GDN checkpoint tensors lose their TP placement and validate against a DP-only mesh.

This change makes Megatron-FSDP recognize the legacy TP attributes documented by make_fsdp_dtensor().

Issue tracking

Linked issue: Fixes #4553

Validation

tools/autoformat.sh via uv run --with black --with isort --with pylint --with ruff --with mypy tools/autoformat.sh
- black, isort, pylint, and ruff passed; no files were modified.
- mypy reported environment missing-import/type issues in the checked files, and the script ignores mypy failures by design.
pytest -q tests/unit_tests/transformer/test_fsdp_dtensor_checkpoint.py (58 passed)
git diff --check
Focused distributed reproducer for the FSDP DTensor metadata path with TP=2 and EP=4 style scaling passed locally. Since the fix preserves copied TP partition metadata instead of hard-coding any parallel size, it should also apply to the reported TP=4 and EP=8 configuration.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code
I have added relevant documentation
I have run the autoformatter.sh on my PR

copy-pr-bot · 2026-05-07T00:04:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cspades

LGTM, I think a while back we swapped to the new TP labels quite quickly, I think this should be fine. cc @shjwudp @xuwchen

GDN conv1d parameters use Megatron's legacy tensor_model_parallel and partition_dim attributes. The FSDP DTensor checkpoint splitter copies those attributes to meta tensors before calling make_fsdp_dtensor, but the FSDP TP detection path only handled _tensor_parallel_mode. Honor the legacy metadata so split GDN checkpoint tensors keep the TP placement and validate against the full FSDP plus TP mesh. Add regression coverage for copied legacy TP attributes and replicated attributes. Fixes NVIDIA#4553. Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>

cspades · 2026-05-07T23:02:16Z

/ok to test e7c49ea

cspades · 2026-05-08T17:57:22Z

/ok to test 118a63e

cspades · 2026-05-08T17:57:59Z

@Glitchfix FYI don't rebase on main since it resets all the tests, pretty sure this PR can go in merge queue cleanly without testing directly on main branch. (But if there are errors in the CI/CD def fix those.)

Glitchfix · 2026-05-08T19:11:58Z

@Glitchfix FYI don't rebase on main since it resets all the tests, pretty sure this PR can go in merge queue cleanly without testing directly on main branch. (But if there are errors in the CI/CD def fix those.)

gotcha, will be careful for upcoming PRs

Glitchfix force-pushed the fix-4553-gdn-fsdp-dtensor branch from 966d3b8 to 26e76ec Compare May 7, 2026 00:10

Glitchfix marked this pull request as ready for review May 7, 2026 00:18

Glitchfix requested review from a team as code owners May 7, 2026 00:18

svcnvidia-nemo-ci requested a review from a team May 7, 2026 00:18

cspades added module: megatron-fsdp Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels May 7, 2026

cspades approved these changes May 7, 2026

View reviewed changes

cspades added Final Review PR is in the "final review" stage and removed Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels May 7, 2026

Glitchfix force-pushed the fix-4553-gdn-fsdp-dtensor branch from 26e76ec to 5bd5fec Compare May 7, 2026 19:05

Merge branch 'main' into fix-4553-gdn-fsdp-dtensor

e7c49ea

copy-pr-bot Bot temporarily deployed to test May 7, 2026 23:03 Inactive

Glitchfix and others added 2 commits May 8, 2026 12:22

Merge branch 'main' into fix-4553-gdn-fsdp-dtensor

03c653f

Merge branch 'main' into fix-4553-gdn-fsdp-dtensor

118a63e

copy-pr-bot Bot temporarily deployed to test May 8, 2026 17:58 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fsdp): recognize legacy GDN TP metadata#4664

fix(fsdp): recognize legacy GDN TP metadata#4664
Glitchfix wants to merge 4 commits intoNVIDIA:mainfrom
Glitchfix:fix-4553-gdn-fsdp-dtensor

Glitchfix commented May 7, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

cspades left a comment •

edited

Loading

Uh oh!

cspades commented May 7, 2026

Uh oh!

cspades commented May 8, 2026

Uh oh!

cspades commented May 8, 2026 •

edited

Loading

Uh oh!

Glitchfix commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Glitchfix commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Validation

Contribution process

Pre-checks

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

cspades left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cspades commented May 7, 2026

Uh oh!

cspades commented May 8, 2026

Uh oh!

cspades commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Glitchfix commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Glitchfix commented May 7, 2026 •

edited

Loading

cspades left a comment •

edited

Loading

cspades commented May 8, 2026 •

edited

Loading