MoE Support for llama3 Using TransformerEngine by kjaniknvidia · Pull Request #1443 · NVIDIA/bionemo-framework

kjaniknvidia · 2026-02-02T19:54:03Z

Added NVLlamaMoETransformerLayer that can replace the TransformerLayer

Replaces the TE.Attention and TE.MLP with TE.Attention and NVLlamaMoEFeedForward

NVLlamaMoEFeedForward uses one generic nn.Linear and then TE.GroupedLinear for the router and FC1/FC2 layers

Summary by CodeRabbit

New Features
- Introduced Mixture-of-Experts (MoE) architecture to the Llama 3 model with configurable expert routing and capacity controls.
Improvements
- Added MoE metrics tracking during training, including auxiliary loss and load distribution monitoring.
- Extended training configuration to support MoE-specific parameters.

Provide a TE-based MoE feed-forward path with top-k routing, capacity control, and aux loss integration, plus expose MoE knobs in recipe configs.

Track aux loss and routing balance stats in the model outputs and perf logger so MoE health is visible in logs and W&B.

copy-pr-bot · 2026-02-02T19:54:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-02T19:54:12Z

📝 Walkthrough

Walkthrough

These changes implement Mixture-of-Experts (MoE) support in the Llama 3 model using Transformer Engine. New MoE routing layers, auxiliary loss tracking, and configuration parameters are introduced. The training pipeline is updated to capture and log MoE-specific metrics including load entropy, capacity utilization, and dropped token statistics.

Changes

Cohort / File(s)	Summary
MoE Architecture `bionemo-recipes/models/llama3/modeling_llama_te.py`	Introduces `NVLlamaMoEFeedForward` and `NVLlamaMoETransformerLayer` classes for expert routing, token capacity management, and auxiliary loss computation. Extends `NVLlamaConfig` with MoE hyperparameters (use_moe, moe_num_experts, moe_top_k, moe_capacity_factor, moe_min_capacity, moe_drop_tokens, moe_aux_loss_coef). Updates `NVLlamaModel` to conditionally instantiate MoE layers. Propagates MoE auxiliary metrics through forward outputs and into `NVLlamaForCausalLM` loss computation. Adds `_te_device()` helper for device routing. Updates `_no_split_modules` to include new MoE layer class.
Configuration `bionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yaml`, `bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py`	Exposes MoE configuration options in Hydra defaults with gating on `use_te` and `use_moe` flags. Initializes MoE attributes on `NVLlamaConfig` from command-line arguments during training setup.
Metrics & Logging `bionemo-recipes/recipes/llama3_native_te/perf_logger.py`	Adds MoE metrics collection (moe_aux_loss, moe_load_entropy, moe_load_max, moe_dropped_tokens, moe_capacity) to logging workflow when `use_moe` is enabled. Implements running counters for MoE statistics aggregation and per-step emission alongside existing metrics.

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Tokens
    participant TE as MoE Transformer<br/>Layer
    participant Router as Expert Router
    participant Experts as Expert<br/>Feed-Forwards
    participant AuxLoss as Auxiliary<br/>Loss Tracker
    participant Output as Layer Output

    Input->>TE: hidden_states
    TE->>Router: Route tokens to experts
    Router->>Router: Calculate capacity<br/>& select top-k experts
    Router->>Experts: Dispatch tokens
    Experts->>Experts: Process via selected<br/>experts
    Experts->>AuxLoss: Load balancing metrics
    AuxLoss->>AuxLoss: Track load_entropy,<br/>load_max, dropped_tokens
    AuxLoss->>Output: Aggregate auxiliary loss
    Experts->>Output: Updated hidden_states
    Output->>Output: Combine expert outputs
    Output-->>Input: (hidden_states, aux_loss,<br/>load_metrics)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰✨ Behold, dear coders brave and true,
MoE routing paths now split in two!
With experts wise and tokens load-balanced,
Our Llama hops toward distributed prance!
🎯 Auxiliary loss guides the way,
As rabbit engineers dance and sway! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description lacks required template sections such as Description details, Usage example, Type of changes, and pre-submit checklist items that are expected per the repository template.	Complete the PR description using the provided template: add a detailed Description section, Usage code example, mark the Type of changes, and confirm pre-submit checklist items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 52.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'MoE Support for llama3 Using TransformerEngine' accurately describes the main change: adding mixture-of-experts support to the llama3 model via TransformerEngine integration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

pstjohn · 2026-02-02T21:20:16Z

@coderabbitai review

coderabbitai · 2026-02-02T21:20:25Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

kjaniknvidia added 2 commits January 22, 2026 14:49

Add MoE FFN routing and training config

3bfab1e

Provide a TE-based MoE feed-forward path with top-k routing, capacity control, and aux loss integration, plus expose MoE knobs in recipe configs.

Log MoE routing metrics during training

c53caec

Track aux loss and routing balance stats in the model outputs and perf logger so MoE health is visible in logs and W&B.

kjaniknvidia requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn, savitha-eng and trvachov as code owners February 2, 2026 19:54

Merge branch 'main' into feat/llama3-te-moe

981e8c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE Support for llama3 Using TransformerEngine#1443

MoE Support for llama3 Using TransformerEngine#1443
kjaniknvidia wants to merge 3 commits intoNVIDIA:mainfrom
kjaniknvidia:feat/llama3-te-moe

kjaniknvidia commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

pstjohn commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kjaniknvidia commented Feb 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

pstjohn commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kjaniknvidia commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading