Stiefel-LoRA: Orthogonalized and Rank-Efficient LoRA#1333
Draft
Koratahiu wants to merge 8 commits intoNerogar:masterfrom
Draft
Stiefel-LoRA: Orthogonalized and Rank-Efficient LoRA#1333Koratahiu wants to merge 8 commits intoNerogar:masterfrom
Koratahiu wants to merge 8 commits intoNerogar:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces Stiefel-LoRA, a novel optimizer designed to address redundant rank collapse in standard LoRA training. Originally proposed by:
"Riemannian Optimization for LoRA on the Stiefel Manifold"
The Problem
Standard optimizers (like AdamW, Muon, etc.) often fail to maintain the expressivity of the chosen rank. For example, a LoRA layer with rank 64 may see its effective rank reduced by ~25% during training (or even more), leading to stalled convergence and sub-optimal performance.
The Solution
Stiefel-LoRA preserves the geometry of the weight matrices through two distinct strategies:
❗ Note: Even DoRA suffers from rank collapse. This method is fully compatible with DoRA and significantly improves its training stability.
Integrated Rank-Invariant Scaling
I have implemented rank-invariant scaling, allowing you to tune your Learning Rate and Weight Decay once and transfer those hyperparameters across any rank.
❗ Important: You MUST set
alpha = rank(effectively disabling the standard LoRA heuristic scaling) for rank invariance to work correctly.Optimizer Architecture
Stiefel-LoRA is a hybrid optimizer:
Usage
adv_optm==2.3.dev3Use Stiefel LoRA