[ckpt] fix: Use DTensor split shapes for Megatron-FSDP TP loading#3746
[ckpt] fix: Use DTensor split shapes for Megatron-FSDP TP loading#3746conver334 wants to merge 1 commit into
Conversation
Signed-off-by: conver334 <conver334@gmail.com>
|
/ok to test 64bcc85 |
| model_provider.context_parallel_size = cp | ||
| model_provider.expert_model_parallel_size = ep | ||
| if cp > 1: | ||
| model_provider.calculate_per_token_loss = True |
There was a problem hiding this comment.
this should not be here, can be added in MBridge's TransformerConfigs finalize - but how does it needed here in conversion / generate?
yaoyu-33
left a comment
There was a problem hiding this comment.
Thanks for tracking down the Qwen3 MoE fused-expert loading bug — the FusedGatedExpertMapping conditional is clearly the load-bearing fix. A design question on the broader change before approving:
Cost of the broadcast
_broadcast_tp_split_shape fires one broadcast_object_list per parallel parameter on every HF→Megatron load. For a large model that's thousands of small synchronous broadcasts on the TP group. Is this needed for the non-fused mappings?
Walking through each call site:
ColumnParallelMapping,RowParallelMapping,GatedMLPMapping: the prior formulas (target_param.shape[parallel_dim] // tp_size) are the correct local TP shard if DTensor reports the unsharded logical shape — which is exactly the assumption the newFusedGatedExpertMappingbranch relies on (gate_full_shape = (target_shape[0] // 2, target_shape[1]), no* tp_size). So either DTensor reports unsharded logical shapes (in which case the old formulas were fine and the broadcast is unnecessary), or it doesn't (in which case the new FusedGatedExpert branch is wrong). Both can't be true simultaneously.FusedGatedExpertMapping: the old code double-shifted bytpwhentarget_shapewas already[2*I, H]. The new conditional is exactly right and should stay.
Suggestion
- Keep the
FusedGatedExpertMappingconditional — that's the real fix. - Revert
ColumnParallelMapping,RowParallelMapping,GatedMLPMappingto the previous DTensor branches. - Drop
_broadcast_tp_split_shapeand theOptional[output_shape]plumbing entirely.
If there's a specific non-fused param where target_param.shape // tp_size actually fails under Megatron-FSDP, could you share the placement spec and observed shape? That would either justify the broadcast or point to a cleaner fix (e.g., normalizing the logical shape at DTensor-wrap time).
If a broadcast is truly unavoidable, consider one batched broadcast_object_list({name: shape, ...}) at the start of the HF→Megatron load instead of one-per-param.
Unrelated change
The calculate_per_token_loss = True when cp > 1 in the two example scripts isn't part of the DTensor shape fix. Worth splitting out or at least calling out separately in the changelog so the bisect story stays clean.
| output_shape = [target_param.shape[0] // self.tp_size, *target_param.shape[1:]] | ||
| else: | ||
| output_shape = target_param.shape | ||
| output_shape = None if isinstance(target_param, DTensor) else target_param.shape |
There was a problem hiding this comment.
Reverting this hunk avoids one broadcast per column-parallel param. The old formula is correct under the same DTensor-shape semantic the new FusedGatedExpertMapping branch assumes.
| output_shape = None if isinstance(target_param, DTensor) else target_param.shape | |
| if isinstance(target_param, DTensor): | |
| output_shape = (target_param.shape[0] // self.tp_size, *target_param.shape[1:]) | |
| else: | |
| output_shape = target_param.shape |
| output_shape = [target_param.shape[0], target_param.shape[1] // self.tp_size, *target_param.shape[2:]] | ||
| else: | ||
| output_shape = target_param.shape | ||
| output_shape = None if isinstance(target_param, DTensor) else target_param.shape |
There was a problem hiding this comment.
Same as the ColumnParallelMapping comment — the prior DTensor branch is correct and avoids a per-param broadcast.
| output_shape = None if isinstance(target_param, DTensor) else target_param.shape | |
| if isinstance(target_param, DTensor) and hf_weights.ndim != 1: | |
| output_shape = (target_param.shape[0], target_param.shape[1] // self.tp_size, *target_param.shape[2:]) | |
| else: | |
| output_shape = target_param.shape |
What does this PR do ?
Fix Megatron-FSDP Hugging Face weight loading for tensor-parallel DTensor parameters, including Qwen3 MoE fused expert gate/up weights.
When loading HF weights into Megatron-FSDP models with tensor parallelism, DTensor parameters can expose a local target shape that is not sufficient to derive the receive buffer for the HF tensor split. This PR broadcasts the actual split tensor shape before scatter and uses that shape for DTensor receive buffers.
Changelog
ColumnParallelMapping,RowParallelMapping, andGatedMLPMapping.FusedGatedExpertMappingDTensor target shapes as logical fused gate/up shapes instead of scaling them by TP again.calculate_per_token_lossin the Megatron-FSDP conversion examples whencp > 1, matching the Qwen3-VL context-parallel setup requirement.GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. An NVIDIA developer will need to approve and trigger the CI for external contributors.
Local validation:
Megatron-FSDP HF load/export roundtrip validation:
TP=2 CP=1 EP=4,TP=1 CP=1 EP=8,TP=2 CP=1 EP=1,TP=4 CP=1 EP=2,TP=2 CP=2 EP=2passedTP=2 CP=1 EP=1passedTP=2 CP=1 EP=4passedBefore your PR is "Ready for review"
Pre checks:
test_param_mapping.pycoverage was rerun successfully, and the fix was validated with Megatron-FSDP HF load/export roundtrip jobs across dense and MoE Qwen models.If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
64bcc85b Fix Megatron-FSDP DTensor TP loadinghf_to_megatron_fsdp_generate_text.pystill does not fully validate CP generation for Qwen3.5. Enablingcalculate_per_token_lossgets past model construction, but the current generation path needs separate CP+MTP-aware handling. This PR keeps the CP-required config setting in the example and does not add a local generation workaround.