Muon CPU offload

**Is your feature request related to a problem? Please describe.**
When training GLM5 with 6 nodes + Muon optimizer + LoRA, it will OOM.

Tag the [@mcore-oncall](https://github.com/orgs/NVIDIA/teams/mcore-oncall) 
to get oncall's attention to this issue.

**Describe the solution you'd like**
Offload the optimizer states of Muon to CPU can help the issue.

**Describe alternatives you've considered**
Increasing # of GPUs can help, but offloading Muon optimizer states like done for Adam will be helpful.

**Additional context**
A tentative PR for CPU offload is created https://github.com/NVIDIA/Megatron-LM/pull/4475, feel free to review!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muon CPU offload #4691

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Muon CPU offload #4691

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions