[training] feat: port mcore PR #4687 MoE memory estimator fix#3744
Draft
cuichenx wants to merge 1 commit into
Draft
[training] feat: port mcore PR #4687 MoE memory estimator fix#3744cuichenx wants to merge 1 commit into
cuichenx wants to merge 1 commit into
Conversation
Replaces the simple TP-only formula in `compute_weight_and_optimizer_memory` with the upstream MoE-aware version from NVIDIA/Megatron-LM PR #4687, mapped to Bridge's `ConfigContainer` interface. The previous formula divided routed-expert parameters by `tensor_model_parallel_size` only, which under-counts memory whenever `expert_tensor_parallel_size * expert_model_parallel_size != tensor_model_parallel_size`. The new logic splits parameters into three buckets: - TP-sharded (attention + dense MLP + shared experts) — divided by TP - Replicated (layernorms + router + shared expert gate) — counted once per rank - Routed experts — divided by ETP * EP Distributed-optimizer state for routed experts is sized by the expert data-parallel domain (world_size / (ETP * EP * PP)) instead of the regular DP. Adds support for MoE layer patterns (int + list moe_layer_freq), shared experts with optional gate, multi-latent attention (DeepSeek-style), and Multi-Token Prediction (MTP) blocks — all of which the prior Bridge formula collapsed into the dense-layer count. This is a draft tracking the upstream PR; land after #4687 merges so the port matches the final upstream form. Refs: NVIDIA/Megatron-LM#4687, NVIDIA/Megatron-LM#4050 Signed-off-by: Chen Cui <chcui@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cuichenx
commented
May 8, 2026
| @@ -1,4 +1,4 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the MoE-aware theoretical memory estimator from upstream NVIDIA/Megatron-LM#4687 (which fixes NVIDIA/Megatron-LM#4050) into Bridge's local
theoretical_memory_utils.py.Bridge maintains its own copy of this estimator (it takes a
ConfigContainer, not Megatron's argparseargs), so the fix doesn't propagate automatically when the submodule is bumped — it has to be downstreamed by hand.Why
The previous Bridge formula divided all transformer-layer parameters — including routed experts — by
tensor_model_parallel_size. Routed experts are actually sharded byexpert_tensor_parallel_size * expert_model_parallel_size, with distributed-optimizer state sized by the expert data-parallel domain. The old formula silently under-reports MoE memory wheneverETP * EP != TP, which is the common case for Mixtral / DeepSeek / Qwen3-MoE / GPT-OSS recipes.The new formula splits parameters into three buckets and applies the correct divisor to each:
Distributed-optimizer state for routed experts uses the expert DP domain (
world_size / (ETP × EP × PP)).Also picks up several upstream features the prior Bridge version was missing:
intandlistform ofmoe_layer_freq)moe_shared_expert_gate)q_lora_rank/kv_lora_rank/ splitqk_head_dim+ RoPE term)The estimator is informational only — it does not affect training correctness or actual memory allocation. Users sizing jobs from the printed memory estimate will get accurate numbers for MoE recipes after this lands.
Why draft
Upstream PR #4687 is open, not merged. The implementation can still change in review. This PR holds the port until #4687 merges, then we'll rebase to match the final upstream form before un-drafting.
Test plan
tests/unit_tests/training/utils/test_theoretical_memory_utils.py:moe_layer_freqis honoredmoe_layer_freqtype raisesTypeErrorcicd-unit-tests-corepicks up the new moduleChecklist
compute_weight_and_optimizer_memory(config: ConfigContainer, verbose: bool = False) -> float)ConfigContaineralready exposes every needed field (verified againstmcoreTransformerConfig+ModelParallelConfig)Refs
🤖 Generated with Claude Code