Skip to content

Stiefel-LoRA: Orthogonalized and Rank-Efficient LoRA#1333

Draft
Koratahiu wants to merge 8 commits intoNerogar:masterfrom
Koratahiu:Stiefel_LoRA
Draft

Stiefel-LoRA: Orthogonalized and Rank-Efficient LoRA#1333
Koratahiu wants to merge 8 commits intoNerogar:masterfrom
Koratahiu:Stiefel_LoRA

Conversation

@Koratahiu
Copy link
Contributor

@Koratahiu Koratahiu commented Feb 19, 2026

This PR introduces Stiefel-LoRA, a novel optimizer designed to address redundant rank collapse in standard LoRA training. Originally proposed by:

"Riemannian Optimization for LoRA on the Stiefel Manifold"

The Problem

Standard optimizers (like AdamW, Muon, etc.) often fail to maintain the expressivity of the chosen rank. For example, a LoRA layer with rank 64 may see its effective rank reduced by ~25% during training (or even more), leading to stalled convergence and sub-optimal performance.

image

The Solution

Stiefel-LoRA preserves the geometry of the weight matrices through two distinct strategies:

  1. B-Factor (Orthogonal): Initialized as an orthogonal matrix with a consistent norm of 1. We apply a standard optimizer update, but the final result is projected back onto the Stiefel Manifold followed by a Retraction step (QR decomposition). This ensures the matrix remains orthogonal, updating only the directions/angles.
  2. A-Factor (Zero-Initialized): Treated as a set of vectors and trained normally using a standard optimizer, which is both theoretically sound and empirically effective.

Note: Even DoRA suffers from rank collapse. This method is fully compatible with DoRA and significantly improves its training stability.


Integrated Rank-Invariant Scaling

I have implemented rank-invariant scaling, allowing you to tune your Learning Rate and Weight Decay once and transfer those hyperparameters across any rank.

Important: You MUST set alpha = rank (effectively disabling the standard LoRA heuristic scaling) for rank invariance to work correctly.

  1. Learning Rate (LR):
  • The Stiefel projection of the B-factor and the sign update of the A-factor are normalized to an RMS of 0.2.
  • This ensures both factors learn at the same rate and maintains the update size typical of Adam, so you don't need to search for new LR ranges.
  1. Weight Decay (WD):
  • B-Factor: WD is ignored, as the B-factor is already constrained by orthogonality; applying decay would destroy its geometric properties.
  • A-Factor: WD is scaled by 1/rank. This ensures the regularization strength is identical for different ranks, e.g. rank 1 and rank 128.
  • Decoupled: WD is decoupled from the LR to prevent bias and eliminate the need for re-tuning.

Optimizer Architecture

Stiefel-LoRA is a hybrid optimizer:

  • For the A-Factor: It utilizes SignSGD. Treating the A-factor as a set of vectors is the most robust approach; SignSGD acts as a "simplified Adam," providing a consistent update magnitude (RMS of 1 before scaling) that is both powerful and memory-efficient.
  • For the B-Factor: It utilizes Riemannian Optimization (Stiefel manifold projection and QR retraction) to ensure the basis remains orthogonal throughout the entire training process.

Usage

  • Update adv_optm==2.3.dev3
  • Enable Use Stiefel LoRA
  • Tune your optimizer and Train!

I found Cautious WD of 0.1 to be very good, and base LR of 1e-4 or 5e-5 (scaled by both BS and AS using learning rate scalar) for Flux2 Klein.

Remember alpha=rank

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant