Refactor: Cleanup and remove redundant adv optimizer parameters by Koratahiu · Pull Request #1363 · Nerogar/OneTrainer

Koratahiu · 2026-03-07T22:02:52Z

No description provided.

Koratahiu · 2026-03-07T22:04:09Z

            'orthogonal_gradient': {'title': 'OrthoGrad', 'tooltip': 'Reduces overfitting by removing the gradient component parallel to the weight, thus improving generalization.', 'type': 'bool'},
            'use_atan2': {'title': 'Atan2 Scaling', 'tooltip': 'A robust replacement for eps, which also incorporates gradient clipping, bounding and stabilizing the optimizer updates.', 'type': 'bool'},
-            'cautious_mask': {'title': 'Cautious Variant', 'tooltip': 'Applies a mask to dampen or zero-out momentum components that disagree with the current gradients direction.', 'type': 'bool'},
-            'grams_moment': {'title': 'GRAMS Variant', 'tooltip': 'Aligns the momentum direction with the current gradient direction while preserving its accumulated magnitude.', 'type': 'bool'},


Those are just a waste of time to tune and enable, and they’re very unstable; they were originally proposed for large BS.

Koratahiu · 2026-03-07T22:04:44Z

            'Simplified_AdEMAMix': {'title': 'Simplified AdEMAMix', 'tooltip': "Enables a simplified, single-EMA variant of AdEMAMix. Instead of blending two moving averages (fast and slow momentum), this version combines the raw current gradient (controlled by 'Grad α') directly with a single theory-based momentum. This makes the optimizer highly responsive to recent gradient information, which can accelerate training in all batch size scenarios when tuned correctly.", 'type': 'bool'},
            'alpha_grad': {'title': 'Grad α', 'tooltip': 'Controls the mixing coefficient between raw gradients and momentum gradients in Simplified AdEMAMix. Higher values (e.g., 10-100) emphasize recent gradients, suitable for small batch sizes to reduce noise. Lower values (e.g., 0-1) emphasize historical gradients, suitable for large batch sizes for stability. Setting to 0 uses only momentum gradients without raw gradient contribution.', 'type': 'float'},
            'kourkoutas_beta': {'title': 'Kourkoutas Beta', 'tooltip': 'Enables a layer-wise dynamic β₂ adaptation. This feature makes the optimizer more responsive to "spiky" gradients by lowering β₂ during periods of high variance, and more stable during calm periods by raising β₂ towards its maximum. It can significantly improve training stability and final loss.', 'type': 'bool'},
-            'k_warmup_steps': {'title': 'K-β Warmup Steps ', 'tooltip': 'When using Kourkoutas Beta, the number of initial training steps during which the dynamic β₂ logic is held off. In this period, β₂ is set to its fixed value to allow for initial training stability before the adaptive mechanism activates.', 'type': 'int'},


Now dynamically calculated based on LR warm-up

Koratahiu · 2026-03-07T22:05:43Z

            'rms_rescaling': {'title': 'RMS Rescaling', 'tooltip': 'Muon already scales its updates to approximate and use the same learning rate (LR) as Adam. This option integrates a more accurate method to match the Adam LR, but it is slower.', 'type': 'bool'},
            'normuon_variant': {'title': 'NorMuon Variant', 'tooltip': 'Enables the NorMuon optimizer variant, which combines Muon orthogonalization with per-neuron adaptive learning rates for better convergence and balanced parameter updates. Costs only one scalar state buffer per parameter group, size few KBs, maintaining high memory efficiency.', 'type': 'bool'},
            'beta2_normuon': {'title': 'NorMuon Beta2', 'tooltip': 'Exponential decay rate for the neuron-wise second-moment estimator in NorMuon (analogous to Adams beta2). Controls how past squared updates influence current normalization.', 'type': 'float'},
-            'normuon_eps': {'title': 'NorMuon EPS', 'tooltip': 'Epsilon for NorMuon normalization stability.', 'type': 'float'},


This has little to no effect and is harmless to remove.

Koratahiu · 2026-03-07T22:07:56Z

            'accelerated_ns': {'title': 'Accelerated Newton-Schulz', 'tooltip': 'Applies an enhanced Newton-Schulz variant that replaces heuristic coefficients with optimal coefficients derived at each step. This improves performance and convergence by reducing the number of required operations.', 'type': 'bool'},
            'cautious_wd': {'title': 'Cautious Weight Decay', 'tooltip': 'Applies weight decay only to parameter coordinates whose signs align with the optimizer update direction. This preserves the original optimization objective while still benefiting from regularization effects, leading to improved convergence and better final performance.', 'type': 'bool'},
            'approx_mars': {'title': 'Approx MARS-M', 'tooltip': 'Enables Approximated MARS-M, a variance reduction technique. It uses the previous step\'s gradient to correct the current update, leading to lower losses and improved convergence stability. This requires additional state to store the previous gradient.', 'type': 'bool'},
-            'kappa_p': {'title': 'Lion-K P-value', 'tooltip': 'Controls the Lp-norm geometry for the Lion update. 1.0 = Standard Lion (Sign update, coordinate-wise), best for Transformers. 2.0 = Spherical Lion (Normalized update, rotational invariant), best for Conv2d layers (in unet models). Values between 1.0 and 2.0 interpolate behavior between the two.', 'type': 'float'},


"auto_kappa_p is enough to tune; I don't think kappa_p values between 1 and 2 have any practical use cases

Koratahiu · 2026-03-07T22:08:55Z


    AdEMAMix = 'AdEMAMix'
    AdEMAMix_8BIT = "AdEMAMix_8BIT"
-    SIMPLIFIED_AdEMAMix = "SIMPLIFIED_AdEMAMix"


Removed, can be integrated as two new options for AdamW_adv instead (Same as in Adopt_adv). WIP

remove now or when AdamW_ADV has it?

Koratahiu · 2026-03-07T22:09:14Z

    PRODIGY = 'PRODIGY'
    PRODIGY_PLUS_SCHEDULE_FREE = 'PRODIGY_PLUS_SCHEDULE_FREE'
    PRODIGY_ADV = 'PRODIGY_ADV'
-    LION_PRODIGY_ADV = 'LION_PRODIGY_ADV'


Removed, heuristic and unstable.

What about any of the original yogi/lion?

Lion is famous enough not to be removed (and it works well if you're training small datasets over short periods, which covers most OT use cases).
As for Yogi, I don't know how it works tbh, but I read somewhere that it’s a poor adjustment that causes Adam to diverge from its natural gradient mechanism.

This PR is for adv optimizers, but I can open another to remove outdated, and redundant optimziers.

That would be appreciated, anything that’s been superseded (except AdamW) should be removed, our optimizer list is the paradox of choice right now

Koratahiu · 2026-03-07T22:14:57Z

                       optimizer_config.beta2 if optimizer_config.beta2 is not None else 0.99),
                eps=optimizer_config.eps if optimizer_config.eps is not None else 1e-8,
                weight_decay=optimizer_config.weight_decay if optimizer_config.weight_decay is not None else 0.0,
-                use_bias_correction=optimizer_config.use_bias_correction if optimizer_config.use_bias_correction is not None else True,


It should always be ON. I don't see a use case for disabling it, so just freeing up some room.

Koratahiu · 2026-03-07T22:18:28Z

        "compile": False,
        "fused_back_pass": False,
-        "use_atan2": False,
+        "use_atan2": True,


Unrelated, but it's a lot better than ADOPT heuristic high epsilon + clipping.

dxqb

looks good to me, but not my area of expertise

initial

9ea9ed8

Koratahiu commented Mar 7, 2026

View reviewed changes

remove use_bias_correction for ADAMW_ADV

fca1a1d

Koratahiu commented Mar 7, 2026

View reviewed changes

Default use_atan2 to True for ADOPT_ADV

8819f26

Koratahiu commented Mar 7, 2026

View reviewed changes

Koratahiu marked this pull request as ready for review March 7, 2026 22:18

Koratahiu changed the title ~~Refactor: Cleanup and remove redundant optimizer parameters~~ Refactor: Cleanup and remove redundant adv optimizer parameters Mar 9, 2026

dxqb approved these changes Mar 10, 2026

View reviewed changes

dxqb added the merging last steps before merge label Mar 12, 2026

dxqb changed the base branch from master to merge March 13, 2026 18:51

dxqb merged commit c24f5d2 into Nerogar:merge Mar 13, 2026
1 check passed

Uh oh!

Conversation

Koratahiu commented Mar 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Koratahiu Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Koratahiu Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dxqb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Koratahiu Mar 7, 2026 •

edited

Loading

Koratahiu Mar 9, 2026 •

edited

Loading