[feat] Add Qwen3 MoE true-on-policy parity#1059
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for Qwen3 MoE true-on-policy training, adding specialized parallel layouts, kernel policies, and model profiles. Key features include weight and gradient auditing for debugging, optimized weight synchronization for expert parallelism, and improved prefill logprob recomputation using full-sequence scoring and batching. Feedback is provided regarding a logic error in the weight auditing sampling function where small tensors result in tripled statistics due to overlapping slices during concatenation.
| sample_size = min(4096, numel) | ||
| midpoint = max(0, (numel - sample_size) // 2) | ||
| sample = torch.cat( | ||
| [ | ||
| flat[:sample_size], | ||
| flat[midpoint : midpoint + sample_size], | ||
| flat[-sample_size:], | ||
| ] | ||
| ) |
There was a problem hiding this comment.
The current sampling logic for weight auditing can lead to incorrect statistics for small tensors. When numel <= 4096, sample_size becomes numel, and the torch.cat operation results in sample containing three copies of the original flat tensor. This will cause sample_sum to be three times the actual sum, skewing the audit results. To fix this, handle small tensors as a special case and retrieve the sample size from the model configuration instead of hardcoding it.
sample_size = config.audit_sample_size
if numel <= sample_size:
sample = flat
else:
midpoint = max(0, (numel - sample_size) // 2)
sample = torch.cat(
[
flat[:sample_size],
flat[midpoint : midpoint + sample_size],
flat[-sample_size:],
]
)References
- Model parameters, such as index_topk, should be retrieved from the model configuration rather than being hardcoded.
c8fc984 to
eececd6
Compare
2047f7d to
798b791
Compare
8a740d7 to
2cd7533
Compare
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n flow Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: zju-stu-lizheng <lizheng.cs@zju.edu.cn> Co-authored-by: zyxiyy02 <282300612+zyxiyy02@users.noreply.github.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fe23383 to
e52a170
Compare
Summary
Adds the Miles side of the Qwen3-30B-A3B MoE true-on-policy stack on top of
radixark/miles:main.This PR is one of three coupled PRs that must land together because they share the
qwen3_moe_true_on_policy_v1contract and were validated as one end-to-end stack.Companion PRs:
Target
Bit-identical rollout/train logprob parity for Qwen3-30B-A3B MoE under deterministic decode.
SGLang remains the numerical source of truth. Miles owns the launch-plan contract, topology validation, and environment/argument plumbing that keeps SGLang rollout and Megatron training on the same policy.
Main Changes
qwen3_moe_true_on_policy_v1contract and Qwen3-MoE launch profile.ParallelState.cp.rank/sizeshape.Validation
Latest rebased EP4 E2E on ion7, Qwen3-30B-A3B, 8x H200, run
qwen3-moe-top-ep4-rebased-e2e-260522-ion7-r4:raysubmit_yc4WK8aCKzP4tx1QTP=1, EP=4, ETP=1, CP=2rollout_num_gpus=8,rollout_num_gpus_per_engine=4,SGLang TP=4,SGLang EP=4recompute_logprobs_via_prefill=False,use_rollout_logprobs=False-2.8848648071289062e-05-2.8848648071289062e-05-2.8848648071289062e-05train/train_rollout_logprob_abs_diff = 0.0train/train_rollout_kl = 0.0Local record:
recovery/qwen3_moe_clean/journal/2026-05-22-qwen3-moe-clean-e2e.mdTest Plan