[model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import (r0.4.0)#3753
Merged
Merged
Conversation
… (r0.4.0) Per-expert ``down_proj`` is square for GPT-OSS-20B/120B (hidden == intermediate), so the bridge cannot auto-detect orientation from shape alone. BF16 checkpoints (e.g. unsloth/gpt-oss-20b-BF16, and what transformers.GptOssForCausalLM produces at init) store it as [E, intermediate, hidden]; MXFP4-dequantized weights come out as [E, hidden, intermediate]. Megatron's TE RowParallelGroupedLinear expects per-expert (hidden, intermediate), so the BF16 path needs a transpose on import while the MXFP4 path is already aligned. Without the import transpose, BF16 imports silently store down_proj in the wrong orientation: roundtrip vs the same BF16 source still matches (import and export are symmetrically broken), but inference is broken — forward-pass cosine similarity vs HF drops to ~0.54 for gpt-oss-20b on a saved/reloaded BF16-imported Megatron checkpoint. Fix the import side in ``maybe_modify_loaded_hf_weight``, and add a coordinated per-expert transpose in ``GPTOSSMLPDownProjMapping.megatron_to_hf`` so the grouped-export stack returns to HF's [E, intermediate, hidden] layout. The shape-detection in ``maybe_modify_loaded_hf_weight`` reads ``self.hf_pretrained.config``. On main this is already populated by ``MegatronModelBridge.build_conversion_tasks``; on r0.4.0 the decentralized-PG refactor (#3674) dropped that assignment, so this backport restores the one-line stash inside ``build_conversion_tasks`` to keep ``self.hf_pretrained`` available to subclass hooks. (No behavioral change beyond making the attribute reachable again.) Verification on r0.4.0 with TP=1 PP=8 EP=1: - BF16 import → forward cos sim vs HF: 0.999973 - MXFP4 import → forward cos sim vs HF: 0.999973 - BF16 import → reload → roundtrip vs BF16 HF: 411/411 ✅ - MXFP4 import → reload → roundtrip vs BF16 HF: 411/411 ✅ Signed-off-by: Chen Cui <chcui@nvidia.com>
Builds on the BF16 import-side transpose by extending the GPT-OSS
``down_proj`` export to handle the EP-aggregated path, and rewrites the
toy conversion test to faithfully model both real checkpoint layouts.
Bridge change (``gpt_oss_bridge.py``)
- ``GPTOSSMLPDownProjMapping.megatron_to_hf`` now transposes the last two
dims of any ndim>=2 weight tensor, not only 2-D ones. Under EP the
parent ``gather_from_ep_ranks`` may concatenate the per-rank experts
before the per-expert export hook runs, producing a 3-D
``(ep_size, hidden, intermediate)`` tensor that the previous 2-D-only
guard skipped. Bias mappings (``hf_param`` ending in ``_bias``) are
passed through unchanged so per-expert biases that arrive 2-D under EP
are not flipped.
Toy test rewrite (``test_gpt_oss_conversion.py``)
- New fixture builds two toys from the same underlying weights:
* BF16 toy: faithful unsloth-style layout
(``gate_up_proj`` ``[E, hidden, 2*intermediate]``, ``down_proj``
``[E, intermediate, hidden]``).
* MXFP4 toy: ``*_blocks``/``*_scales`` whose ``_dequantize_mxfp4``
output equals the BF16 toy transposed per expert, matching the
``openai/gpt-oss-20b`` shipping layout.
- Test parametrizes over ``source ∈ {bf16, mxfp4}`` × ``{PP=2, EP=2}``.
BF16 runs the existing one-shot roundtrip; MXFP4 runs as a two-step
``convert_checkpoints_multi_gpu.py import`` then
``hf_megatron_roundtrip_multi_gpu.py --megatron-load-path`` against
the BF16 toy as the reference, since the verification table cannot
resolve ``down_proj``/``gate_up_proj`` keys in a quantized state
dict.
- ``hidden_size`` and ``intermediate_size`` are intentionally unequal so
that any wrong-direction transpose surfaces as a shape mismatch
(square real-model shapes silently mask layout bugs as wrong values).
Verification on this branch
- All 4 toy parametrizations pass:
``bf16-PP``, ``bf16-EP``, ``mxfp4-PP``, ``mxfp4-EP``.
- Real model (``unsloth/gpt-oss-20b-BF16`` HF reference, TP=1):
* BF16 import → forward cos sim vs HF: PP=8 0.999973, EP=8 0.999975.
* MXFP4 import → forward cos sim vs HF: PP=8 0.999973, EP=8 0.999975.
* Reload-roundtrip vs BF16 HF: 411/411 ✅ for all four
(BF16/MXFP4) × (PP=8/EP=8) combinations.
Signed-off-by: Chen Cui <chcui@nvidia.com>
Contributor
Author
|
/ok to test 25294b5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backports the GPT-OSS BF16
down_projorientation fix from #3743 to ther0.4.0release branch.This is a silent inference-correctness bug: BF16 GPT-OSS checkpoints (e.g.
unsloth/gpt-oss-20b-BF16, or anythingtransformers.GptOssForCausalLMproduces at init) imported through the bridge ran inference withdown_projweights stored in the wrong orientation. Forward-pass cosine similarity vs HF dropped to ~0.54 on a saved/reloaded BF16-imported Megatron checkpoint, even though the in-memory roundtrip looked clean (import and export were symmetrically broken on the BF16 path).Root cause
Per-expert
down_projis square for GPT-OSS-20B/120B (hidden == intermediate), so the bridge cannot auto-detect orientation from shape alone:[E, intermediate, hidden], mirroringgate_up_proj's[E, hidden, 2*intermediate]convention.[E, hidden, intermediate].RowParallelGroupedLinearexpects per-expert(hidden, intermediate).gate_up_projis non-square, so_align_expert_weight_to_shapealready auto-detects and transposes;down_projwas passed straight through with no orientation alignment, so MXFP4 happened to land correct (matching the GEMM layout) and BF16 silently landed transposed.Why prior PRs did not fully address this
fix gpt-oss down_proj weight handling): Removed an incorrect import-side transpose and restored the export-side transpose to feed vLLM the HF-convention layout. Resolved the immediate vLLM divergence but left the bridge-internal layout assumption ("Megatron expects[in, out]") incorrect — the asymmetric import/export silently corrupted any saved/reloaded checkpoint.adaptive transpose_on_export for GPT-OSS expert weights): Identified that [model] fix: fix gpt-oss down_proj weight handling #3162's asymmetry corrupted save→reload cycles and tried to paper over it by adding an adaptivetranspose_on_export = Truewrapper. Worked for the cases it was tested against but added a special case that the rest of the bridge didn't model uniformly.fix gpt oss export): Reverted [model] fix: Use adaptive transpose_on_export for GPT-OSS expert weights #3250's adaptive wrapper and removed the export-side transpose entirely on the grounds that "the transpose is a NeMo-RL refit concern, not a bridge concern." Restored the symmetric-broken state where BF16 in-memory roundtrip succeeds and inference is silently wrong.What none of these caught is that Megatron's TE GroupedLinear expects per-expert
(hidden, intermediate)— the standard PyTorchnn.Linearconvention — not(intermediate, hidden). With BF16 source, the import has to transpose, and the export has to transpose back. With MXFP4 source, dequantization already emits the right layout. A forward-pass cosine-similarity check against HF would have caught any of these regressions; it wasn't run on either prior fix's verification path.Fix
GPTOSSBridge.maybe_modify_loaded_hf_weight, transposedown_projwhen loading from a non-quantized BF16 checkpoint. Disambiguation: when the per-expert shape is non-square, shape-vs-config uniquely identifies layout; when square (gpt-oss-20b/120b), default to thetransformers.GptOssForCausalLMinit layout[E, intermediate, hidden].GPTOSSMLPDownProjMapping.megatron_to_hf, transpose the last two dims of eachndim>=2weight tensor on the way out so the grouped-export stack reassembles in HF's[E, intermediate, hidden]layout. Under EP,gather_from_ep_ranksmay have already concatenated per-rank experts into a 3-D(ep_size, hidden, intermediate)tensor, so the transpose runs unconditionally on the trailing two dims rather than only on 2-D inputs. Bias mappings are passed through untouched.MegatronModelBridge.build_conversion_tasksthat stashesself.hf_pretrained = hf_pretrained. Onmainthis assignment is already present; onr0.4.0it was dropped by the decentralized-PG refactor in [model, ckpt, docs] fix: support HF→Megatron conversion under decentralized PGs (r0.4.0) #3674. The shape-detection inmaybe_modify_loaded_hf_weightreadsself.hf_pretrained.config, so we restore the stash to keep that hook self-contained. No behavioral change beyond making the attribute reachable again.The MXFP4 dequant branch is left as-is (already produces the GEMM-correct layout).
Test changes
tests/functional_tests/test_groups/models/gpt_oss/test_gpt_oss_conversion.pyis rewritten to build two faithful toys from the same underlying weights:[E, hidden, 2*intermediate]/[E, intermediate, hidden]layout.*_blocks/*_scaleswhose_dequantize_mxfp4output equals the BF16 toy transposed per expert, matchingopenai/gpt-oss-20b's shipping layout.source ∈ {bf16, mxfp4}×{PP=2, EP=2}. MXFP4 runs as a two-stepconvert_checkpoints_multi_gpu.py importfollowed byhf_megatron_roundtrip_multi_gpu.py --megatron-load-pathagainst the BF16 toy reference, since the verification table cannot resolvedown_proj/gate_up_projkeys in a quantized state dict.hidden_size != intermediate_sizeis intentional so any wrong-direction transpose surfaces as a shape mismatch — the previous toy used the MXFP4-dequant orientation fordown_proj, hiding the bug behind a symmetric pass-through.Verification
Real model on this branch (HF reference:
unsloth/gpt-oss-20b-BF16, TP=1):Toy tests on this branch:
Test plan
examples/conversion/compare_hf_and_megatron/compare.pycos sim vs HF for both BF16 and MXFP4 imports under PP=8 and EP=8examples/conversion/hf_megatron_roundtrip_multi_gpu.pywith--megatron-load-pathfrom both BF16 and MXFP4 imports compared against unsloth BF16 under PP=8 and EP=8tests/functional_tests/test_groups/models/gpt_oss/test_gpt_oss_conversion.py— all 4 parametrizations green