Skip to content

Refactor: Cleanup and remove redundant adv optimizer parameters#1363

Merged
dxqb merged 3 commits intoNerogar:mergefrom
Koratahiu:reduce_params
Mar 13, 2026
Merged

Refactor: Cleanup and remove redundant adv optimizer parameters#1363
dxqb merged 3 commits intoNerogar:mergefrom
Koratahiu:reduce_params

Conversation

@Koratahiu
Copy link
Copy Markdown
Contributor

No description provided.

'orthogonal_gradient': {'title': 'OrthoGrad', 'tooltip': 'Reduces overfitting by removing the gradient component parallel to the weight, thus improving generalization.', 'type': 'bool'},
'use_atan2': {'title': 'Atan2 Scaling', 'tooltip': 'A robust replacement for eps, which also incorporates gradient clipping, bounding and stabilizing the optimizer updates.', 'type': 'bool'},
'cautious_mask': {'title': 'Cautious Variant', 'tooltip': 'Applies a mask to dampen or zero-out momentum components that disagree with the current gradients direction.', 'type': 'bool'},
'grams_moment': {'title': 'GRAMS Variant', 'tooltip': 'Aligns the momentum direction with the current gradient direction while preserving its accumulated magnitude.', 'type': 'bool'},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are just a waste of time to tune and enable, and they’re very unstable; they were originally proposed for large BS.

'Simplified_AdEMAMix': {'title': 'Simplified AdEMAMix', 'tooltip': "Enables a simplified, single-EMA variant of AdEMAMix. Instead of blending two moving averages (fast and slow momentum), this version combines the raw current gradient (controlled by 'Grad α') directly with a single theory-based momentum. This makes the optimizer highly responsive to recent gradient information, which can accelerate training in all batch size scenarios when tuned correctly.", 'type': 'bool'},
'alpha_grad': {'title': 'Grad α', 'tooltip': 'Controls the mixing coefficient between raw gradients and momentum gradients in Simplified AdEMAMix. Higher values (e.g., 10-100) emphasize recent gradients, suitable for small batch sizes to reduce noise. Lower values (e.g., 0-1) emphasize historical gradients, suitable for large batch sizes for stability. Setting to 0 uses only momentum gradients without raw gradient contribution.', 'type': 'float'},
'kourkoutas_beta': {'title': 'Kourkoutas Beta', 'tooltip': 'Enables a layer-wise dynamic β₂ adaptation. This feature makes the optimizer more responsive to "spiky" gradients by lowering β₂ during periods of high variance, and more stable during calm periods by raising β₂ towards its maximum. It can significantly improve training stability and final loss.', 'type': 'bool'},
'k_warmup_steps': {'title': 'K-β Warmup Steps ', 'tooltip': 'When using Kourkoutas Beta, the number of initial training steps during which the dynamic β₂ logic is held off. In this period, β₂ is set to its fixed value to allow for initial training stability before the adaptive mechanism activates.', 'type': 'int'},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now dynamically calculated based on LR warm-up

'rms_rescaling': {'title': 'RMS Rescaling', 'tooltip': 'Muon already scales its updates to approximate and use the same learning rate (LR) as Adam. This option integrates a more accurate method to match the Adam LR, but it is slower.', 'type': 'bool'},
'normuon_variant': {'title': 'NorMuon Variant', 'tooltip': 'Enables the NorMuon optimizer variant, which combines Muon orthogonalization with per-neuron adaptive learning rates for better convergence and balanced parameter updates. Costs only one scalar state buffer per parameter group, size few KBs, maintaining high memory efficiency.', 'type': 'bool'},
'beta2_normuon': {'title': 'NorMuon Beta2', 'tooltip': 'Exponential decay rate for the neuron-wise second-moment estimator in NorMuon (analogous to Adams beta2). Controls how past squared updates influence current normalization.', 'type': 'float'},
'normuon_eps': {'title': 'NorMuon EPS', 'tooltip': 'Epsilon for NorMuon normalization stability.', 'type': 'float'},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has little to no effect and is harmless to remove.

'accelerated_ns': {'title': 'Accelerated Newton-Schulz', 'tooltip': 'Applies an enhanced Newton-Schulz variant that replaces heuristic coefficients with optimal coefficients derived at each step. This improves performance and convergence by reducing the number of required operations.', 'type': 'bool'},
'cautious_wd': {'title': 'Cautious Weight Decay', 'tooltip': 'Applies weight decay only to parameter coordinates whose signs align with the optimizer update direction. This preserves the original optimization objective while still benefiting from regularization effects, leading to improved convergence and better final performance.', 'type': 'bool'},
'approx_mars': {'title': 'Approx MARS-M', 'tooltip': 'Enables Approximated MARS-M, a variance reduction technique. It uses the previous step\'s gradient to correct the current update, leading to lower losses and improved convergence stability. This requires additional state to store the previous gradient.', 'type': 'bool'},
'kappa_p': {'title': 'Lion-K P-value', 'tooltip': 'Controls the Lp-norm geometry for the Lion update. 1.0 = Standard Lion (Sign update, coordinate-wise), best for Transformers. 2.0 = Spherical Lion (Normalized update, rotational invariant), best for Conv2d layers (in unet models). Values between 1.0 and 2.0 interpolate behavior between the two.', 'type': 'float'},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"auto_kappa_p is enough to tune; I don't think kappa_p values between 1 and 2 have any practical use cases


AdEMAMix = 'AdEMAMix'
AdEMAMix_8BIT = "AdEMAMix_8BIT"
SIMPLIFIED_AdEMAMix = "SIMPLIFIED_AdEMAMix"
Copy link
Copy Markdown
Contributor Author

@Koratahiu Koratahiu Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, can be integrated as two new options for AdamW_adv instead (Same as in Adopt_adv). WIP

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove now or when AdamW_ADV has it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove now

PRODIGY = 'PRODIGY'
PRODIGY_PLUS_SCHEDULE_FREE = 'PRODIGY_PLUS_SCHEDULE_FREE'
PRODIGY_ADV = 'PRODIGY_ADV'
LION_PRODIGY_ADV = 'LION_PRODIGY_ADV'
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, heuristic and unstable.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about any of the original yogi/lion?

Copy link
Copy Markdown
Contributor Author

@Koratahiu Koratahiu Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lion is famous enough not to be removed (and it works well if you're training small datasets over short periods, which covers most OT use cases).
​As for Yogi, I don't know how it works tbh, but I read somewhere that it’s a poor adjustment that causes Adam to diverge from its natural gradient mechanism.

This PR is for adv optimizers, but I can open another to remove outdated, and redundant optimziers.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be appreciated, anything that’s been superseded (except AdamW) should be removed, our optimizer list is the paradox of choice right now

Comment thread modules/util/create.py
optimizer_config.beta2 if optimizer_config.beta2 is not None else 0.99),
eps=optimizer_config.eps if optimizer_config.eps is not None else 1e-8,
weight_decay=optimizer_config.weight_decay if optimizer_config.weight_decay is not None else 0.0,
use_bias_correction=optimizer_config.use_bias_correction if optimizer_config.use_bias_correction is not None else True,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should always be ON. I don't see a use case for disabling it, so just freeing up some room.

"compile": False,
"fused_back_pass": False,
"use_atan2": False,
"use_atan2": True,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but it's a lot better than ADOPT heuristic high epsilon + clipping.

@Koratahiu Koratahiu marked this pull request as ready for review March 7, 2026 22:18
@Koratahiu Koratahiu changed the title Refactor: Cleanup and remove redundant optimizer parameters Refactor: Cleanup and remove redundant adv optimizer parameters Mar 9, 2026
Copy link
Copy Markdown
Collaborator

@dxqb dxqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, but not my area of expertise

@dxqb dxqb added the merging last steps before merge label Mar 12, 2026
@dxqb dxqb changed the base branch from master to merge March 13, 2026 18:51
@dxqb dxqb merged commit c24f5d2 into Nerogar:merge Mar 13, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merging last steps before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants