Skip to content

[feat] Add Qwen3 MoE true-on-policy parity#30

Open
maocheng23 wants to merge 22 commits into
miles-mainfrom
feat/true_on_policy_qwen_moe
Open

[feat] Add Qwen3 MoE true-on-policy parity#30
maocheng23 wants to merge 22 commits into
miles-mainfrom
feat/true_on_policy_qwen_moe

Conversation

@maocheng23
Copy link
Copy Markdown

@maocheng23 maocheng23 commented May 1, 2026

Summary

Adds the Megatron-LM side of the Qwen3-30B-A3B MoE true-on-policy stack.

This PR is one of three coupled PRs that must land together because they share the qwen3_moe_true_on_policy_v1 contract and were validated as one end-to-end stack.

Companion PRs:

Target

Bit-identical rollout/train logprob parity for Qwen3-30B-A3B MoE under deterministic decode, with differentiable Megatron backward.

SGLang remains the numerical source of truth. Megatron reproduces the SGLang MoE numerics in training, including deterministic routing/top-k and EP reduction behavior required by the contract.

Main Changes

  • Adds the Qwen3-MoE true-on-policy extension path at the MoE layer boundary.
  • Uses SGLang fused MoE behavior for the direct parity path and local autograd support for training.
  • Updates the SGLang fused MoE imports to the rebased moe_runner.triton_utils layout.
  • Keeps the MoE-specific parity path narrow; generic Megatron router/dispatcher fallback edits and debug-only helpers were cleaned out.
  • Preserves dense-stack behavior outside the explicit qwen3_moe_true_on_policy_v1 contract.

Validation

Latest rebased EP4 E2E on ion7, Qwen3-30B-A3B, 8x H200, run qwen3-moe-top-ep4-rebased-e2e-260522-ion7-r4:

  • Ray job: raysubmit_yc4WK8aCKzP4tx1Q
  • Train: TP=1, EP=4, ETP=1, CP=2
  • Rollout: rollout_num_gpus=8, rollout_num_gpus_per_engine=4, SGLang TP=4, SGLang EP=4
  • Runtime confirmed: recompute_logprobs_via_prefill=False, use_rollout_logprobs=False
  • rollout logprob: -2.8848648071289062e-05
  • ref logprob: -2.8848648071289062e-05
  • train logprob: -2.8848648071289062e-05
  • train/train_rollout_logprob_abs_diff = 0.0
  • train/train_rollout_kl = 0.0

Local record:
recovery/qwen3_moe_clean/journal/2026-05-22-qwen3-moe-clean-e2e.md

Test Plan

  • Rebased onto current radixark/Megatron-LM:miles-main stack.
  • Local syntax/diff sanity checks for touched Python paths.
  • Remote import/syntax checks in the ion7 container.
  • EP4 E2E exact-zero logprob validation on ion7.
  • CI replay for the three-PR stack.
  • Longer on/off-policy throughput comparison after reviewer signoff on the cleaned deterministic path.

@maocheng23 maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch from c63c77f to 9c390a1 Compare May 5, 2026 01:51
@maocheng23 maocheng23 force-pushed the feat/true_on_policy_qwen_dense branch 2 times, most recently from 9546575 to 57258c8 Compare May 18, 2026 06:15
@maocheng23 maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch 3 times, most recently from ab103f8 to 9c6c2a9 Compare May 19, 2026 18:44
@maocheng23 maocheng23 changed the title [feat] Init true on policy with qwen_moe [feat] Add Qwen3 MoE true-on-policy parity May 21, 2026
maocheng23 and others added 20 commits May 22, 2026 19:02
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn>
Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- forward.py: single canonical sglang_moe_forward (102 lines)
- autograd.py: pure backward wrapper calling shared forward (182 lines)
- moe_experts.py: weight-layout adapter only, no forward logic (36 lines)
- moe_layer_ext.py: one linear orchestration path (301 lines)
- Remove verbose RuntimeError guards, replace with asserts
- Remove weight caching (premature optimization)
- Consolidate two parallel forward paths into one

Net: -574 lines, structurally impossible for forward paths to diverge.
Co-authored-by: Cursor <cursoragent@cursor.com>
@maocheng23 maocheng23 force-pushed the feat/true_on_policy_qwen_moe branch from 7ac2a80 to b0679e2 Compare May 23, 2026 02:47
@maocheng23 maocheng23 marked this pull request as ready for review May 23, 2026 21:22
@maocheng23 maocheng23 changed the base branch from feat/true_on_policy_qwen_dense to miles-main May 23, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant