Skip to content

fix(fsdp): recognize legacy GDN TP metadata#4664

Open
Glitchfix wants to merge 4 commits intoNVIDIA:mainfrom
Glitchfix:fix-4553-gdn-fsdp-dtensor
Open

fix(fsdp): recognize legacy GDN TP metadata#4664
Glitchfix wants to merge 4 commits intoNVIDIA:mainfrom
Glitchfix:fix-4553-gdn-fsdp-dtensor

Conversation

@Glitchfix
Copy link
Copy Markdown

@Glitchfix Glitchfix commented May 7, 2026

What does this PR do ?

Fix Megatron-FSDP DTensor checkpoint saving for GDN fused tensors whose copied meta tensors only carry legacy Megatron tensor-parallel metadata.

GDN conv1d parameters are annotated with tensor_model_parallel and partition_dim. handle_gdn_in_state_dict() copies those attributes to meta tensors before calling make_fsdp_dtensor(), but the FSDP tensor-parallel detection path only recognized _tensor_parallel_mode. This could make split GDN checkpoint tensors lose their TP placement and validate against a DP-only mesh.

This change makes Megatron-FSDP recognize the legacy TP attributes documented by make_fsdp_dtensor().

Issue tracking

Linked issue: Fixes #4553

Validation

  • tools/autoformat.sh via uv run --with black --with isort --with pylint --with ruff --with mypy tools/autoformat.sh
    • black, isort, pylint, and ruff passed; no files were modified.
    • mypy reported environment missing-import/type issues in the checked files, and the script ignores mypy failures by design.
  • pytest -q tests/unit_tests/transformer/test_fsdp_dtensor_checkpoint.py (58 passed)
  • git diff --check
  • Focused distributed reproducer for the FSDP DTensor metadata path with TP=2 and EP=4 style scaling passed locally. Since the fix preserves copied TP partition metadata instead of hard-coding any parallel size, it should also apply to the reported TP=4 and EP=8 configuration.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Glitchfix Glitchfix force-pushed the fix-4553-gdn-fsdp-dtensor branch from 966d3b8 to 26e76ec Compare May 7, 2026 00:10
@Glitchfix Glitchfix marked this pull request as ready for review May 7, 2026 00:18
@Glitchfix Glitchfix requested review from a team as code owners May 7, 2026 00:18
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 7, 2026 00:18
@cspades cspades added module: megatron-fsdp Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels May 7, 2026
Copy link
Copy Markdown
Member

@cspades cspades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think a while back we swapped to the new TP labels quite quickly, I think this should be fine. cc @shjwudp @xuwchen

@cspades cspades added Final Review PR is in the "final review" stage and removed Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels May 7, 2026
GDN conv1d parameters use Megatron's legacy tensor_model_parallel and
partition_dim attributes. The FSDP DTensor checkpoint splitter copies
those attributes to meta tensors before calling make_fsdp_dtensor, but
the FSDP TP detection path only handled _tensor_parallel_mode.

Honor the legacy metadata so split GDN checkpoint tensors keep the TP
placement and validate against the full FSDP plus TP mesh.

Add regression coverage for copied legacy TP attributes and replicated
attributes.

Fixes NVIDIA#4553.

Signed-off-by: Shivanjan Chakravorty <shivanjanc@nvidia.com>
@Glitchfix Glitchfix force-pushed the fix-4553-gdn-fsdp-dtensor branch from 26e76ec to 5bd5fec Compare May 7, 2026 19:05
@cspades
Copy link
Copy Markdown
Member

cspades commented May 7, 2026

/ok to test e7c49ea

@cspades
Copy link
Copy Markdown
Member

cspades commented May 8, 2026

/ok to test 118a63e

@cspades
Copy link
Copy Markdown
Member

cspades commented May 8, 2026

@Glitchfix FYI don't rebase on main since it resets all the tests, pretty sure this PR can go in merge queue cleanly without testing directly on main branch. (But if there are errors in the CI/CD def fix those.)

@Glitchfix
Copy link
Copy Markdown
Author

@Glitchfix FYI don't rebase on main since it resets all the tests, pretty sure this PR can go in merge queue cleanly without testing directly on main branch. (But if there are errors in the CI/CD def fix those.)

gotcha, will be careful for upcoming PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Final Review PR is in the "final review" stage module: megatron-fsdp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Megatron-Fsdp checkpoint fails on Qwen3.5 with tensor and expert paralelism

2 participants