Is your feature request related to a problem? Please describe.
When training GLM5 with 6 nodes + Muon optimizer + LoRA, it will OOM.
Tag the @mcore-oncall
to get oncall's attention to this issue.
Describe the solution you'd like
Offload the optimizer states of Muon to CPU can help the issue.
Describe alternatives you've considered
Increasing # of GPUs can help, but offloading Muon optimizer states like done for Adam will be helpful.
Additional context
A tentative PR for CPU offload is created #4475, feel free to review!
Is your feature request related to a problem? Please describe.
When training GLM5 with 6 nodes + Muon optimizer + LoRA, it will OOM.
Tag the @mcore-oncall
to get oncall's attention to this issue.
Describe the solution you'd like
Offload the optimizer states of Muon to CPU can help the issue.
Describe alternatives you've considered
Increasing # of GPUs can help, but offloading Muon optimizer states like done for Adam will be helpful.
Additional context
A tentative PR for CPU offload is created #4475, feel free to review!