[ckpt] fix: Use DTensor split shapes for Megatron-FSDP TP loading by conver334 · Pull Request #3746 · NVIDIA-NeMo/Megatron-Bridge

conver334 · 2026-05-08T06:08:53Z

What does this PR do ?

Fix Megatron-FSDP Hugging Face weight loading for tensor-parallel DTensor parameters, including Qwen3 MoE fused expert gate/up weights.

When loading HF weights into Megatron-FSDP models with tensor parallelism, DTensor parameters can expose a local target shape that is not sufficient to derive the receive buffer for the HF tensor split. This PR broadcasts the actual split tensor shape before scatter and uses that shape for DTensor receive buffers.

Changelog

Use the actual HF tensor split shape, broadcast from TP rank 0.
Apply the DTensor split-shape path to ColumnParallelMapping, RowParallelMapping, and GatedMLPMapping.
Treat FusedGatedExpertMapping DTensor target shapes as logical fused gate/up shapes instead of scaling them by TP again.
Enable calculate_per_token_loss in the Megatron-FSDP conversion examples when cp > 1, matching the Qwen3-VL context-parallel setup requirement.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. An NVIDIA developer will need to approve and trigger the CI for external contributors.

Local validation:
Megatron-FSDP HF load/export roundtrip validation:

Qwen3.5-35B-A3B:
- TP=2 CP=1 EP=4, TP=1 CP=1 EP=8, TP=2 CP=1 EP=1,TP=4 CP=1 EP=2,TP=2 CP=2 EP=2 passed
Qwen2.5-7B:
- TP=2 CP=1 EP=1 passed
Qwen3-30B-A3B:
- TP=2 CP=1 EP=4 passed

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
[] Did you write any new necessary tests?
- No new unit test changes are included in this PR. Existing test_param_mapping.py coverage was rerun successfully, and the fix was validated with Megatron-FSDP HF load/export roundtrip jobs across dense and MoE Qwen models.
[] Did you add or update any necessary documentation?
- No documentation update is required; this fixes conversion behavior and updates conversion examples only.
[] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- No new optional dependency or import path is introduced.
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related issue: N/A
Branch head: 64bcc85b Fix Megatron-FSDP DTensor TP loading
Notes: hf_to_megatron_fsdp_generate_text.py still does not fully validate CP generation for Qwen3.5. Enabling calculate_per_token_loss gets past model construction, but the current generation path needs separate CP+MTP-aware handling. This PR keeps the CP-required config setting in the example and does not add a local generation workaround.

Signed-off-by: conver334 <conver334@gmail.com>

copy-pr-bot · 2026-05-08T06:08:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

conver334 · 2026-05-08T06:14:11Z

/ok to test 64bcc85

yaoyu-33 · 2026-05-11T20:16:36Z

    model_provider.context_parallel_size = cp
    model_provider.expert_model_parallel_size = ep
+    if cp > 1:
+        model_provider.calculate_per_token_loss = True


this should not be here, can be added in MBridge's TransformerConfigs finalize - but how does it needed here in conversion / generate?

yaoyu-33

Thanks for tracking down the Qwen3 MoE fused-expert loading bug — the FusedGatedExpertMapping conditional is clearly the load-bearing fix. A design question on the broader change before approving:

Cost of the broadcast

_broadcast_tp_split_shape fires one broadcast_object_list per parallel parameter on every HF→Megatron load. For a large model that's thousands of small synchronous broadcasts on the TP group. Is this needed for the non-fused mappings?

Walking through each call site:

ColumnParallelMapping, RowParallelMapping, GatedMLPMapping: the prior formulas (target_param.shape[parallel_dim] // tp_size) are the correct local TP shard if DTensor reports the unsharded logical shape — which is exactly the assumption the new FusedGatedExpertMapping branch relies on (gate_full_shape = (target_shape[0] // 2, target_shape[1]), no * tp_size). So either DTensor reports unsharded logical shapes (in which case the old formulas were fine and the broadcast is unnecessary), or it doesn't (in which case the new FusedGatedExpert branch is wrong). Both can't be true simultaneously.
FusedGatedExpertMapping: the old code double-shifted by tp when target_shape was already [2*I, H]. The new conditional is exactly right and should stay.

Suggestion

Keep the FusedGatedExpertMapping conditional — that's the real fix.
Revert ColumnParallelMapping, RowParallelMapping, GatedMLPMapping to the previous DTensor branches.
Drop _broadcast_tp_split_shape and the Optional[output_shape] plumbing entirely.

If there's a specific non-fused param where target_param.shape // tp_size actually fails under Megatron-FSDP, could you share the placement spec and observed shape? That would either justify the broadcast or point to a cleaner fix (e.g., normalizing the logical shape at DTensor-wrap time).

If a broadcast is truly unavoidable, consider one batched broadcast_object_list({name: shape, ...}) at the start of the HF→Megatron load instead of one-per-param.

Unrelated change

The calculate_per_token_loss = True when cp > 1 in the two example scripts isn't part of the DTensor shape fix. Worth splitting out or at least calling out separately in the changelog so the bisect story stays clean.

yaoyu-33 · 2026-05-11T20:34:43Z

-            output_shape = [target_param.shape[0] // self.tp_size, *target_param.shape[1:]]
-        else:
-            output_shape = target_param.shape
+        output_shape = None if isinstance(target_param, DTensor) else target_param.shape


Reverting this hunk avoids one broadcast per column-parallel param. The old formula is correct under the same DTensor-shape semantic the new FusedGatedExpertMapping branch assumes.

Suggested change

output_shape = None if isinstance(target_param, DTensor) else target_param.shape

if isinstance(target_param, DTensor):

output_shape = (target_param.shape[0] // self.tp_size, *target_param.shape[1:])

else:

output_shape = target_param.shape

yaoyu-33 · 2026-05-11T20:34:43Z

-            output_shape = [target_param.shape[0], target_param.shape[1] // self.tp_size, *target_param.shape[2:]]
-        else:
-            output_shape = target_param.shape
+        output_shape = None if isinstance(target_param, DTensor) else target_param.shape


Same as the ColumnParallelMapping comment — the prior DTensor branch is correct and avoids a per-param broadcast.

Suggested change

output_shape = None if isinstance(target_param, DTensor) else target_param.shape

if isinstance(target_param, DTensor) and hf_weights.ndim != 1:

output_shape = (target_param.shape[0], target_param.shape[1] // self.tp_size, *target_param.shape[2:])

else:

output_shape = target_param.shape

Fix Megatron-FSDP DTensor TP loading

64bcc85

Signed-off-by: conver334 <conver334@gmail.com>

copy-pr-bot Bot temporarily deployed to public May 8, 2026 06:14 Inactive

copy-pr-bot Bot temporarily deployed to test May 8, 2026 06:15 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 06:26 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 06:27 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 06:41 Inactive

conver334 mentioned this pull request May 11, 2026

Draft: compatible with Megatron-FSDP TP #2299

Closed

5 tasks

conver334 requested review from HollowMan6 and cspades and removed request for HollowMan6 May 11, 2026 10:00

yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels May 11, 2026

yaoyu-33 reviewed May 11, 2026

View reviewed changes

yaoyu-33 added needs-author Author action is required before review or merge can continue and removed needs-review PR is ready for code review and waiting on a reviewer labels May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ckpt] fix: Use DTensor split shapes for Megatron-FSDP TP loading#3746

[ckpt] fix: Use DTensor split shapes for Megatron-FSDP TP loading#3746
conver334 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
conver334:codex/mfsdp-dtensor-qwen35-fix

conver334 commented May 8, 2026

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

conver334 commented May 8, 2026

Uh oh!

yaoyu-33 May 11, 2026

Uh oh!

yaoyu-33 left a comment

Uh oh!

yaoyu-33 May 11, 2026

Uh oh!

yaoyu-33 May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

conver334 commented May 8, 2026

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 8, 2026

Uh oh!

conver334 commented May 8, 2026

Uh oh!

yaoyu-33 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 left a comment

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants