MoE Support for llama3 Using TransformerEngine#1443
MoE Support for llama3 Using TransformerEngine#1443kjaniknvidia wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Provide a TE-based MoE feed-forward path with top-k routing, capacity control, and aux loss integration, plus expose MoE knobs in recipe configs.
Track aux loss and routing balance stats in the model outputs and perf logger so MoE health is visible in logs and W&B.
📝 WalkthroughWalkthroughThese changes implement Mixture-of-Experts (MoE) support in the Llama 3 model using Transformer Engine. New MoE routing layers, auxiliary loss tracking, and configuration parameters are introduced. The training pipeline is updated to capture and log MoE-specific metrics including load entropy, capacity utilization, and dropped token statistics. Changes
Sequence Diagram(s)sequenceDiagram
participant Input as Input Tokens
participant TE as MoE Transformer<br/>Layer
participant Router as Expert Router
participant Experts as Expert<br/>Feed-Forwards
participant AuxLoss as Auxiliary<br/>Loss Tracker
participant Output as Layer Output
Input->>TE: hidden_states
TE->>Router: Route tokens to experts
Router->>Router: Calculate capacity<br/>& select top-k experts
Router->>Experts: Dispatch tokens
Experts->>Experts: Process via selected<br/>experts
Experts->>AuxLoss: Load balancing metrics
AuxLoss->>AuxLoss: Track load_entropy,<br/>load_max, dropped_tokens
AuxLoss->>Output: Aggregate auxiliary loss
Experts->>Output: Updated hidden_states
Output->>Output: Combine expert outputs
Output-->>Input: (hidden_states, aux_loss,<br/>load_metrics)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Added NVLlamaMoETransformerLayer that can replace the TransformerLayer
Replaces the TE.Attention and TE.MLP with TE.Attention and NVLlamaMoEFeedForward
NVLlamaMoEFeedForward uses one generic nn.Linear and then TE.GroupedLinear for the router and FC1/FC2 layers
Summary by CodeRabbit
New Features
Improvements