Skip to content

[ckpt] fix: Use DTensor split shapes for Megatron-FSDP TP loading#3746

Open
conver334 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
conver334:codex/mfsdp-dtensor-qwen35-fix
Open

[ckpt] fix: Use DTensor split shapes for Megatron-FSDP TP loading#3746
conver334 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
conver334:codex/mfsdp-dtensor-qwen35-fix

Conversation

@conver334
Copy link
Copy Markdown
Contributor

What does this PR do ?

Fix Megatron-FSDP Hugging Face weight loading for tensor-parallel DTensor parameters, including Qwen3 MoE fused expert gate/up weights.

When loading HF weights into Megatron-FSDP models with tensor parallelism, DTensor parameters can expose a local target shape that is not sufficient to derive the receive buffer for the HF tensor split. This PR broadcasts the actual split tensor shape before scatter and uses that shape for DTensor receive buffers.

Changelog

  • Use the actual HF tensor split shape, broadcast from TP rank 0.
  • Apply the DTensor split-shape path to ColumnParallelMapping, RowParallelMapping, and GatedMLPMapping.
  • Treat FusedGatedExpertMapping DTensor target shapes as logical fused gate/up shapes instead of scaling them by TP again.
  • Enable calculate_per_token_loss in the Megatron-FSDP conversion examples when cp > 1, matching the Qwen3-VL context-parallel setup requirement.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. An NVIDIA developer will need to approve and trigger the CI for external contributors.

Local validation:
Megatron-FSDP HF load/export roundtrip validation:

  • Qwen3.5-35B-A3B:
    • TP=2 CP=1 EP=4, TP=1 CP=1 EP=8, TP=2 CP=1 EP=1,TP=4 CP=1 EP=2,TP=2 CP=2 EP=2 passed
  • Qwen2.5-7B:
    • TP=2 CP=1 EP=1 passed
  • Qwen3-30B-A3B:
    • TP=2 CP=1 EP=4 passed

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • [] Did you write any new necessary tests?
    • No new unit test changes are included in this PR. Existing test_param_mapping.py coverage was rerun successfully, and the fix was validated with Megatron-FSDP HF load/export roundtrip jobs across dense and MoE Qwen models.
  • [] Did you add or update any necessary documentation?
    • No documentation update is required; this fixes conversion behavior and updates conversion examples only.
  • [] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • No new optional dependency or import path is introduced.
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related issue: N/A
  • Branch head: 64bcc85b Fix Megatron-FSDP DTensor TP loading
  • Notes: hf_to_megatron_fsdp_generate_text.py still does not fully validate CP generation for Qwen3.5. Enabling calculate_per_token_loss gets past model construction, but the current generation path needs separate CP+MTP-aware handling. This PR keeps the CP-required config setting in the example and does not add a local generation workaround.

Signed-off-by: conver334 <conver334@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@conver334
Copy link
Copy Markdown
Contributor Author

/ok to test 64bcc85

@conver334 conver334 requested review from HollowMan6 and cspades and removed request for HollowMan6 May 11, 2026 10:00
@yaoyu-33 yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels May 11, 2026
model_provider.context_parallel_size = cp
model_provider.expert_model_parallel_size = ep
if cp > 1:
model_provider.calculate_per_token_loss = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be here, can be added in MBridge's TransformerConfigs finalize - but how does it needed here in conversion / generate?

Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking down the Qwen3 MoE fused-expert loading bug — the FusedGatedExpertMapping conditional is clearly the load-bearing fix. A design question on the broader change before approving:

Cost of the broadcast

_broadcast_tp_split_shape fires one broadcast_object_list per parallel parameter on every HF→Megatron load. For a large model that's thousands of small synchronous broadcasts on the TP group. Is this needed for the non-fused mappings?

Walking through each call site:

  • ColumnParallelMapping, RowParallelMapping, GatedMLPMapping: the prior formulas (target_param.shape[parallel_dim] // tp_size) are the correct local TP shard if DTensor reports the unsharded logical shape — which is exactly the assumption the new FusedGatedExpertMapping branch relies on (gate_full_shape = (target_shape[0] // 2, target_shape[1]), no * tp_size). So either DTensor reports unsharded logical shapes (in which case the old formulas were fine and the broadcast is unnecessary), or it doesn't (in which case the new FusedGatedExpert branch is wrong). Both can't be true simultaneously.
  • FusedGatedExpertMapping: the old code double-shifted by tp when target_shape was already [2*I, H]. The new conditional is exactly right and should stay.

Suggestion

  1. Keep the FusedGatedExpertMapping conditional — that's the real fix.
  2. Revert ColumnParallelMapping, RowParallelMapping, GatedMLPMapping to the previous DTensor branches.
  3. Drop _broadcast_tp_split_shape and the Optional[output_shape] plumbing entirely.

If there's a specific non-fused param where target_param.shape // tp_size actually fails under Megatron-FSDP, could you share the placement spec and observed shape? That would either justify the broadcast or point to a cleaner fix (e.g., normalizing the logical shape at DTensor-wrap time).

If a broadcast is truly unavoidable, consider one batched broadcast_object_list({name: shape, ...}) at the start of the HF→Megatron load instead of one-per-param.

Unrelated change

The calculate_per_token_loss = True when cp > 1 in the two example scripts isn't part of the DTensor shape fix. Worth splitting out or at least calling out separately in the changelog so the bisect story stays clean.

output_shape = [target_param.shape[0] // self.tp_size, *target_param.shape[1:]]
else:
output_shape = target_param.shape
output_shape = None if isinstance(target_param, DTensor) else target_param.shape
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting this hunk avoids one broadcast per column-parallel param. The old formula is correct under the same DTensor-shape semantic the new FusedGatedExpertMapping branch assumes.

Suggested change
output_shape = None if isinstance(target_param, DTensor) else target_param.shape
if isinstance(target_param, DTensor):
output_shape = (target_param.shape[0] // self.tp_size, *target_param.shape[1:])
else:
output_shape = target_param.shape

output_shape = [target_param.shape[0], target_param.shape[1] // self.tp_size, *target_param.shape[2:]]
else:
output_shape = target_param.shape
output_shape = None if isinstance(target_param, DTensor) else target_param.shape
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the ColumnParallelMapping comment — the prior DTensor branch is correct and avoids a per-param broadcast.

Suggested change
output_shape = None if isinstance(target_param, DTensor) else target_param.shape
if isinstance(target_param, DTensor) and hf_weights.ndim != 1:
output_shape = (target_param.shape[0], target_param.shape[1] // self.tp_size, *target_param.shape[2:])
else:
output_shape = target_param.shape

@yaoyu-33 yaoyu-33 added needs-author Author action is required before review or merge can continue and removed needs-review PR is ready for code review and waiting on a reviewer labels May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working needs-author Author action is required before review or merge can continue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants