Skip to content

MoE Support for llama3 Using TransformerEngine#1443

Open
kjaniknvidia wants to merge 3 commits intoNVIDIA:mainfrom
kjaniknvidia:feat/llama3-te-moe
Open

MoE Support for llama3 Using TransformerEngine#1443
kjaniknvidia wants to merge 3 commits intoNVIDIA:mainfrom
kjaniknvidia:feat/llama3-te-moe

Conversation

@kjaniknvidia
Copy link
Collaborator

@kjaniknvidia kjaniknvidia commented Feb 2, 2026

Added NVLlamaMoETransformerLayer that can replace the TransformerLayer

Replaces the TE.Attention and TE.MLP with TE.Attention and NVLlamaMoEFeedForward

NVLlamaMoEFeedForward uses one generic nn.Linear and then TE.GroupedLinear for the router and FC1/FC2 layers

Summary by CodeRabbit

  • New Features

    • Introduced Mixture-of-Experts (MoE) architecture to the Llama 3 model with configurable expert routing and capacity controls.
  • Improvements

    • Added MoE metrics tracking during training, including auxiliary loss and load distribution monitoring.
    • Extended training configuration to support MoE-specific parameters.

Provide a TE-based MoE feed-forward path with top-k routing, capacity control,
and aux loss integration, plus expose MoE knobs in recipe configs.
Track aux loss and routing balance stats in the model outputs and perf logger
so MoE health is visible in logs and W&B.
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

📝 Walkthrough

Walkthrough

These changes implement Mixture-of-Experts (MoE) support in the Llama 3 model using Transformer Engine. New MoE routing layers, auxiliary loss tracking, and configuration parameters are introduced. The training pipeline is updated to capture and log MoE-specific metrics including load entropy, capacity utilization, and dropped token statistics.

Changes

Cohort / File(s) Summary
MoE Architecture
bionemo-recipes/models/llama3/modeling_llama_te.py
Introduces NVLlamaMoEFeedForward and NVLlamaMoETransformerLayer classes for expert routing, token capacity management, and auxiliary loss computation. Extends NVLlamaConfig with MoE hyperparameters (use_moe, moe_num_experts, moe_top_k, moe_capacity_factor, moe_min_capacity, moe_drop_tokens, moe_aux_loss_coef). Updates NVLlamaModel to conditionally instantiate MoE layers. Propagates MoE auxiliary metrics through forward outputs and into NVLlamaForCausalLM loss computation. Adds _te_device() helper for device routing. Updates _no_split_modules to include new MoE layer class.
Configuration
bionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yaml, bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py
Exposes MoE configuration options in Hydra defaults with gating on use_te and use_moe flags. Initializes MoE attributes on NVLlamaConfig from command-line arguments during training setup.
Metrics & Logging
bionemo-recipes/recipes/llama3_native_te/perf_logger.py
Adds MoE metrics collection (moe_aux_loss, moe_load_entropy, moe_load_max, moe_dropped_tokens, moe_capacity) to logging workflow when use_moe is enabled. Implements running counters for MoE statistics aggregation and per-step emission alongside existing metrics.

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Tokens
    participant TE as MoE Transformer<br/>Layer
    participant Router as Expert Router
    participant Experts as Expert<br/>Feed-Forwards
    participant AuxLoss as Auxiliary<br/>Loss Tracker
    participant Output as Layer Output

    Input->>TE: hidden_states
    TE->>Router: Route tokens to experts
    Router->>Router: Calculate capacity<br/>& select top-k experts
    Router->>Experts: Dispatch tokens
    Experts->>Experts: Process via selected<br/>experts
    Experts->>AuxLoss: Load balancing metrics
    AuxLoss->>AuxLoss: Track load_entropy,<br/>load_max, dropped_tokens
    AuxLoss->>Output: Aggregate auxiliary loss
    Experts->>Output: Updated hidden_states
    Output->>Output: Combine expert outputs
    Output-->>Input: (hidden_states, aux_loss,<br/>load_metrics)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰✨ Behold, dear coders brave and true,
MoE routing paths now split in two!
With experts wise and tokens load-balanced,
Our Llama hops toward distributed prance!
🎯 Auxiliary loss guides the way,
As rabbit engineers dance and sway! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description lacks required template sections such as Description details, Usage example, Type of changes, and pre-submit checklist items that are expected per the repository template. Complete the PR description using the provided template: add a detailed Description section, Usage code example, mark the Type of changes, and confirm pre-submit checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 52.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'MoE Support for llama3 Using TransformerEngine' accurately describes the main change: adding mixture-of-experts support to the llama3 model via TransformerEngine integration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pstjohn
Copy link
Collaborator

pstjohn commented Feb 2, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants