[model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import by cuichenx · Pull Request #3743 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-05-07T23:29:25Z

Summary

Fixes a silent inference-correctness bug in the GPT-OSS bridge: BF16 GPT-OSS checkpoints (e.g. unsloth/gpt-oss-20b-BF16, or anything transformers.GptOssForCausalLM produces at init) imported through the bridge ran inference with down_proj weights stored in the wrong orientation. Forward-pass cosine similarity vs HF dropped to ~0.54 on a saved/reloaded BF16-imported Megatron checkpoint, even though the in-memory roundtrip looked clean (import and export were symmetrically broken on the BF16 path).

Root cause

Per-expert down_proj is square for GPT-OSS-20B/120B (hidden == intermediate), so the bridge can't auto-detect orientation from shape alone:

BF16 checkpoints store it as [E, intermediate, hidden], mirroring gate_up_proj's [E, hidden, 2*intermediate] convention.
MXFP4-dequantized weights come out as [E, hidden, intermediate].
Megatron's TE RowParallelGroupedLinear expects per-expert (hidden, intermediate).

gate_up_proj is non-square, so _align_expert_weight_to_shape already auto-detects and transposes; down_proj was passed straight through with no orientation alignment, so MXFP4 happened to land correct (matching the GEMM layout) and BF16 silently landed transposed.

The toy GPT-OSS conversion test masked this because it manually transposed down_proj into the MXFP4-dequant orientation, making both import and export symmetrically broken on BF16 — the roundtrip "passed" while inference was wrong.

Why prior PRs didn't fully address this

This is the third attempt at this area. Each prior fix was guided by surface-level roundtrip results without forward-pass parity verification of the bridge's internal layout, so each only addressed a slice of the problem and was later partially reverted:

[model] fix: fix gpt-oss down_proj weight handling #3162 (fix gpt-oss down_proj weight handling): Removed an incorrect import-side transpose and restored the export-side transpose to feed vLLM the HF-convention layout. Resolved the immediate vLLM divergence but left the bridge-internal layout assumption ("Megatron expects [in, out]") incorrect — the asymmetric import/export silently corrupted any saved/reloaded checkpoint.
[model] fix: Use adaptive transpose_on_export for GPT-OSS expert weights #3250 (adaptive transpose_on_export for GPT-OSS expert weights): Identified that [model] fix: fix gpt-oss down_proj weight handling #3162's asymmetry corrupted save→reload cycles and tried to paper over it by adding an adaptive transpose_on_export = True wrapper that compared stacked-shape against the original HF tensor and flipped only when needed. The adaptive wrapper worked for the cases it was tested against but added a special case that the rest of the bridge didn't model uniformly.
[model] fix: fix gpt oss export #3271 (fix gpt oss export): Reverted [model] fix: Use adaptive transpose_on_export for GPT-OSS expert weights #3250's adaptive wrapper and removed the export-side transpose entirely on the grounds that "the transpose is a NeMo-RL refit concern, not a bridge concern." Restored the symmetric-broken state where BF16 in-memory roundtrip succeeds and inference is silently wrong (the failure mode this PR fixes).

What none of these caught is that Megatron's TE GroupedLinear expects per-expert (hidden, intermediate) — the standard PyTorch nn.Linear convention — not (intermediate, hidden). With BF16 source, the import has to transpose, and the export has to transpose back. With MXFP4 source, dequantization already emits the right layout. A forward-pass cosine-similarity check against HF would have caught any of these regressions; it wasn't run on either prior fix's verification path.

Fix

Import side — in GPTOSSBridge.maybe_modify_loaded_hf_weight, transpose down_proj when loading from a non-quantized BF16 checkpoint. Disambiguation: when the per-expert shape is non-square, shape-vs-config uniquely identifies layout; when square (gpt-oss-20b/120b), default to the transformers.GptOssForCausalLM init layout [E, intermediate, hidden].
Export side — in GPTOSSMLPDownProjMapping.megatron_to_hf, transpose the last two dims of each ndim>=2 weight tensor on the way out so the grouped-export stack reassembles in HF's [E, intermediate, hidden] layout. Under EP, gather_from_ep_ranks may have already concatenated per-rank experts into a 3-D (ep_size, hidden, intermediate) tensor, so the transpose runs unconditionally on the trailing two dims rather than only on 2-D inputs. Bias mappings are passed through untouched.

The MXFP4 dequant branch is left as-is (already produces the GEMM-correct layout).

Test changes

tests/functional_tests/test_groups/models/gpt_oss/test_gpt_oss_conversion.py is rewritten to build two faithful toys from the same underlying weights:
- BF16 toy with the unsloth-style [E, hidden, 2*intermediate] / [E, intermediate, hidden] layout.
- MXFP4 toy with *_blocks/*_scales whose _dequantize_mxfp4 output equals the BF16 toy transposed per expert, matching openai/gpt-oss-20b's shipping layout.
Parametrized over source ∈ {bf16, mxfp4} × {PP=2, EP=2}. MXFP4 runs as a two-step convert_checkpoints_multi_gpu.py import followed by hf_megatron_roundtrip_multi_gpu.py --megatron-load-path against the BF16 toy reference, since the verification table cannot resolve down_proj/gate_up_proj keys in a quantized state dict.
hidden_size != intermediate_size is intentional so any wrong-direction transpose surfaces as a shape mismatch — the previous toy used the MXFP4-dequant orientation for down_proj, hiding the bug behind a symmetric pass-through.

Verification

Real model on this branch (HF reference: unsloth/gpt-oss-20b-BF16, TP=1):

Check	Before	After
BF16 import → forward cos sim vs HF (PP=8)	0.540 ❌	0.999973 ✅
BF16 import → forward cos sim vs HF (EP=8)	0.540 ❌	0.999975 ✅
MXFP4 import → forward cos sim vs HF (PP=8)	0.999973 ✅	0.999973 ✅
MXFP4 import → forward cos sim vs HF (EP=8)	0.999975 ✅	0.999975 ✅
BF16 import → reload → roundtrip vs BF16 HF (PP=8)	411/411 ✅	411/411 ✅
BF16 import → reload → roundtrip vs BF16 HF (EP=8)	411/411 ✅	411/411 ✅
MXFP4 import → reload → roundtrip vs BF16 HF (PP=8)	24 ❌ (every layer's `down_proj`)	411/411 ✅
MXFP4 import → reload → roundtrip vs BF16 HF (EP=8)	24 ❌ (every layer's `down_proj`)	411/411 ✅

Toy tests on this branch:

Source	Parallelism	Status
BF16	PP=2	✅
BF16	EP=2	✅
MXFP4	PP=2	✅ (two-step)
MXFP4	EP=2	✅ (two-step)

Test plan

examples/conversion/compare_hf_and_megatron/compare.py cos sim vs HF for both BF16 and MXFP4 imports under PP=8 and EP=8
examples/conversion/hf_megatron_roundtrip_multi_gpu.py with --megatron-load-path from both BF16 and MXFP4 imports compared against unsloth BF16 under PP=8 and EP=8
examples/conversion/hf_megatron_roundtrip_multi_gpu.py direct in-memory BF16 roundtrip
tests/functional_tests/test_groups/models/gpt_oss/test_gpt_oss_conversion.py — all 4 parametrizations green

Per-expert ``down_proj`` is square for GPT-OSS-20B/120B (hidden == intermediate), so the bridge cannot auto-detect orientation from shape alone. BF16 checkpoints (e.g. unsloth/gpt-oss-20b-BF16, and what transformers.GptOssForCausalLM produces at init) store it as [E, intermediate, hidden]; MXFP4-dequantized weights come out as [E, hidden, intermediate]. Megatron's TE RowParallelGroupedLinear expects per-expert (hidden, intermediate), so the BF16 path needs a transpose on import while the MXFP4 path is already aligned. Without the import transpose, BF16 imports silently store down_proj in the wrong orientation: roundtrip vs the same BF16 source still matches (import and export are symmetrically broken), but inference is broken — forward-pass cosine similarity vs HF drops to ~0.54 for gpt-oss-20b on a saved/reloaded BF16-imported Megatron checkpoint. Fix the import side in ``maybe_modify_loaded_hf_weight``, and add a coordinated per-expert transpose in ``GPTOSSMLPDownProjMapping.megatron_to_hf`` so the grouped-export stack returns to HF's [E, intermediate, hidden] layout. Verification on origin/main with TP=1 PP=8 EP=1: - BF16 import → forward cos sim vs HF: 0.540 → 0.999973 - MXFP4 import → forward cos sim vs HF: 0.999973 → 0.999973 - BF16 import → reload → roundtrip vs BF16 HF: 411/411 ✅ - MXFP4 import → reload → roundtrip vs BF16 HF: 24 ❌ → 411/411 ✅ Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-05-07T23:29:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

claude · 2026-05-07T23:34:27Z

Review

The fix logic looks correct -- the three-way import-side branch (non-square BF16, already-aligned, square) and the coordinated export-side transpose properly handle both BF16 and MXFP4 paths.

The main concern is that the primary fix path (BF16 import transpose) is not exercised by any existing test. The functional test fixture in test_gpt_oss_conversion.py pre-transposes down_proj to (E, hidden, intermediate) (line 66), so the new detection code always takes the no-op branch. The export-side .t() in GPTOSSMLPDownProjMapping.megatron_to_hf is also not unit tested.

The PR description acknowledges this and calls it intentional follow-up work, but given this is a correctness fix for inference, adding at least a unit test for maybe_modify_loaded_hf_weight with a BF16-layout tensor would significantly increase confidence. See inline comments for specifics.

Suggested test cases: No perf tests impacted.

Builds on the BF16 import-side transpose by extending the GPT-OSS ``down_proj`` export to handle the EP-aggregated path, and rewrites the toy conversion test to faithfully model both real checkpoint layouts. Bridge change (``gpt_oss_bridge.py``) - ``GPTOSSMLPDownProjMapping.megatron_to_hf`` now transposes the last two dims of any ndim>=2 weight tensor, not only 2-D ones. Under EP the parent ``gather_from_ep_ranks`` may concatenate the per-rank experts before the per-expert export hook runs, producing a 3-D ``(ep_size, hidden, intermediate)`` tensor that the previous 2-D-only guard skipped. Bias mappings (``hf_param`` ending in ``_bias``) are passed through unchanged so per-expert biases that arrive 2-D under EP are not flipped. Toy test rewrite (``test_gpt_oss_conversion.py``) - New fixture builds two toys from the same underlying weights: * BF16 toy: faithful unsloth-style layout (``gate_up_proj`` ``[E, hidden, 2*intermediate]``, ``down_proj`` ``[E, intermediate, hidden]``). * MXFP4 toy: ``*_blocks``/``*_scales`` whose ``_dequantize_mxfp4`` output equals the BF16 toy transposed per expert, matching the ``openai/gpt-oss-20b`` shipping layout. - Test parametrizes over ``source ∈ {bf16, mxfp4}`` × ``{PP=2, EP=2}``. BF16 runs the existing one-shot roundtrip; MXFP4 runs as a two-step ``convert_checkpoints_multi_gpu.py import`` then ``hf_megatron_roundtrip_multi_gpu.py --megatron-load-path`` against the BF16 toy as the reference, since the verification table cannot resolve ``down_proj``/``gate_up_proj`` keys in a quantized state dict. - ``hidden_size`` and ``intermediate_size`` are intentionally unequal so that any wrong-direction transpose surfaces as a shape mismatch (square real-model shapes silently mask layout bugs as wrong values). Verification on this branch - All 4 toy parametrizations pass: ``bf16-PP``, ``bf16-EP``, ``mxfp4-PP``, ``mxfp4-EP``. - Real model (``unsloth/gpt-oss-20b-BF16`` HF reference, TP=1): * BF16 import → forward cos sim vs HF: PP=8 0.999973, EP=8 0.999975. * MXFP4 import → forward cos sim vs HF: PP=8 0.999973, EP=8 0.999975. * Reload-roundtrip vs BF16 HF: 411/411 ✅ for all four (BF16/MXFP4) × (PP=8/EP=8) combinations. Signed-off-by: Chen Cui <chcui@nvidia.com>

claude · 2026-05-08T00:46:11Z

@@ -34,69 +36,158 @@
 }


The entire motivation for this PR is that down_proj is square for GPT-OSS-20B/120B (hidden_size == intermediate_size), yet the toy config uses hidden_size=512, intermediate_size=1536 (non-square). This means the elif intermediate == hidden branch in maybe_modify_loaded_hf_weight (the most critical one) is never exercised by these tests.

Consider adding a second toy config with hidden_size == intermediate_size (e.g. hidden_size=512, intermediate_size=512) and parametrizing the test over both configs. That would cover the shape-ambiguous path that actually broke production.

claude · 2026-05-08T00:47:13Z

Light Code Review -- The fix is well-motivated and the coordinated import+export transpose approach is sound. The PR description is excellent. One concern: the toy config uses hidden_size=512 and intermediate_size=1536 (non-square), so the elif intermediate == hidden branch in maybe_modify_loaded_hf_weight -- the most critical one for GPT-OSS-20B/120B -- is never exercised. A second toy config with hidden_size == intermediate_size would close this gap. No perf tests impacted.

cuichenx · 2026-05-08T00:47:54Z

/claude strict-review

cuichenx · 2026-05-08T00:49:32Z

/ok to test 1b26072

cuichenx · 2026-05-08T16:49:13Z

RL verification also passed

cuichenx · 2026-05-08T18:49:58Z

/ok to test a6b3b63

claude Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/gpt_oss/gpt_oss_bridge.py

claude Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/gpt_oss/gpt_oss_bridge.py Outdated

cuichenx marked this pull request as draft May 7, 2026 23:53

cuichenx marked this pull request as ready for review May 8, 2026 00:44

claude Bot reviewed May 8, 2026

View reviewed changes

Merge branch 'main' into chcui/gpt-oss-down-proj-import-fix

1b26072

copy-pr-bot Bot temporarily deployed to public May 8, 2026 00:50 Inactive

copy-pr-bot Bot temporarily deployed to test May 8, 2026 00:50 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 02:55 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 02:56 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 03:10 Inactive

cuichenx added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label May 8, 2026

cuichenx mentioned this pull request May 8, 2026

[model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import (r0.4.0) #3753

Merged

3 tasks

cuichenx added needs-more-tests Requires additional L0 and L1 test coverage before merge and removed r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. labels May 8, 2026

Merge branch 'main' into chcui/gpt-oss-down-proj-import-fix

a6b3b63

copy-pr-bot Bot temporarily deployed to public May 8, 2026 18:50 Inactive

copy-pr-bot Bot temporarily deployed to test May 8, 2026 18:50 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 19:05 Inactive

copy-pr-bot Bot temporarily deployed to public May 8, 2026 19:19 Inactive

cuichenx merged commit d35bae9 into main May 8, 2026
129 checks passed

cuichenx deleted the chcui/gpt-oss-down-proj-import-fix branch May 8, 2026 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import#3743

[model, ckpt] fix: align GPT-OSS BF16 down_proj orientation on import#3743
cuichenx merged 4 commits into
mainfrom
chcui/gpt-oss-down-proj-import-fix

cuichenx commented May 7, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 7, 2026

Uh oh!

claude Bot May 8, 2026

Uh oh!

claude Bot commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cuichenx commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Why prior PRs didn't fully address this

Fix

Test changes

Verification

Test plan

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 7, 2026

Review

Uh oh!

claude Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

cuichenx commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cuichenx commented May 7, 2026 •

edited

Loading