Refactor: Cleanup and remove redundant adv optimizer parameters#1363
Refactor: Cleanup and remove redundant adv optimizer parameters#1363dxqb merged 3 commits intoNerogar:mergefrom
Conversation
| 'orthogonal_gradient': {'title': 'OrthoGrad', 'tooltip': 'Reduces overfitting by removing the gradient component parallel to the weight, thus improving generalization.', 'type': 'bool'}, | ||
| 'use_atan2': {'title': 'Atan2 Scaling', 'tooltip': 'A robust replacement for eps, which also incorporates gradient clipping, bounding and stabilizing the optimizer updates.', 'type': 'bool'}, | ||
| 'cautious_mask': {'title': 'Cautious Variant', 'tooltip': 'Applies a mask to dampen or zero-out momentum components that disagree with the current gradients direction.', 'type': 'bool'}, | ||
| 'grams_moment': {'title': 'GRAMS Variant', 'tooltip': 'Aligns the momentum direction with the current gradient direction while preserving its accumulated magnitude.', 'type': 'bool'}, |
There was a problem hiding this comment.
Those are just a waste of time to tune and enable, and they’re very unstable; they were originally proposed for large BS.
| 'Simplified_AdEMAMix': {'title': 'Simplified AdEMAMix', 'tooltip': "Enables a simplified, single-EMA variant of AdEMAMix. Instead of blending two moving averages (fast and slow momentum), this version combines the raw current gradient (controlled by 'Grad α') directly with a single theory-based momentum. This makes the optimizer highly responsive to recent gradient information, which can accelerate training in all batch size scenarios when tuned correctly.", 'type': 'bool'}, | ||
| 'alpha_grad': {'title': 'Grad α', 'tooltip': 'Controls the mixing coefficient between raw gradients and momentum gradients in Simplified AdEMAMix. Higher values (e.g., 10-100) emphasize recent gradients, suitable for small batch sizes to reduce noise. Lower values (e.g., 0-1) emphasize historical gradients, suitable for large batch sizes for stability. Setting to 0 uses only momentum gradients without raw gradient contribution.', 'type': 'float'}, | ||
| 'kourkoutas_beta': {'title': 'Kourkoutas Beta', 'tooltip': 'Enables a layer-wise dynamic β₂ adaptation. This feature makes the optimizer more responsive to "spiky" gradients by lowering β₂ during periods of high variance, and more stable during calm periods by raising β₂ towards its maximum. It can significantly improve training stability and final loss.', 'type': 'bool'}, | ||
| 'k_warmup_steps': {'title': 'K-β Warmup Steps ', 'tooltip': 'When using Kourkoutas Beta, the number of initial training steps during which the dynamic β₂ logic is held off. In this period, β₂ is set to its fixed value to allow for initial training stability before the adaptive mechanism activates.', 'type': 'int'}, |
There was a problem hiding this comment.
Now dynamically calculated based on LR warm-up
| 'rms_rescaling': {'title': 'RMS Rescaling', 'tooltip': 'Muon already scales its updates to approximate and use the same learning rate (LR) as Adam. This option integrates a more accurate method to match the Adam LR, but it is slower.', 'type': 'bool'}, | ||
| 'normuon_variant': {'title': 'NorMuon Variant', 'tooltip': 'Enables the NorMuon optimizer variant, which combines Muon orthogonalization with per-neuron adaptive learning rates for better convergence and balanced parameter updates. Costs only one scalar state buffer per parameter group, size few KBs, maintaining high memory efficiency.', 'type': 'bool'}, | ||
| 'beta2_normuon': {'title': 'NorMuon Beta2', 'tooltip': 'Exponential decay rate for the neuron-wise second-moment estimator in NorMuon (analogous to Adams beta2). Controls how past squared updates influence current normalization.', 'type': 'float'}, | ||
| 'normuon_eps': {'title': 'NorMuon EPS', 'tooltip': 'Epsilon for NorMuon normalization stability.', 'type': 'float'}, |
There was a problem hiding this comment.
This has little to no effect and is harmless to remove.
| 'accelerated_ns': {'title': 'Accelerated Newton-Schulz', 'tooltip': 'Applies an enhanced Newton-Schulz variant that replaces heuristic coefficients with optimal coefficients derived at each step. This improves performance and convergence by reducing the number of required operations.', 'type': 'bool'}, | ||
| 'cautious_wd': {'title': 'Cautious Weight Decay', 'tooltip': 'Applies weight decay only to parameter coordinates whose signs align with the optimizer update direction. This preserves the original optimization objective while still benefiting from regularization effects, leading to improved convergence and better final performance.', 'type': 'bool'}, | ||
| 'approx_mars': {'title': 'Approx MARS-M', 'tooltip': 'Enables Approximated MARS-M, a variance reduction technique. It uses the previous step\'s gradient to correct the current update, leading to lower losses and improved convergence stability. This requires additional state to store the previous gradient.', 'type': 'bool'}, | ||
| 'kappa_p': {'title': 'Lion-K P-value', 'tooltip': 'Controls the Lp-norm geometry for the Lion update. 1.0 = Standard Lion (Sign update, coordinate-wise), best for Transformers. 2.0 = Spherical Lion (Normalized update, rotational invariant), best for Conv2d layers (in unet models). Values between 1.0 and 2.0 interpolate behavior between the two.', 'type': 'float'}, |
There was a problem hiding this comment.
"auto_kappa_p is enough to tune; I don't think kappa_p values between 1 and 2 have any practical use cases
|
|
||
| AdEMAMix = 'AdEMAMix' | ||
| AdEMAMix_8BIT = "AdEMAMix_8BIT" | ||
| SIMPLIFIED_AdEMAMix = "SIMPLIFIED_AdEMAMix" |
There was a problem hiding this comment.
Removed, can be integrated as two new options for AdamW_adv instead (Same as in Adopt_adv). WIP
There was a problem hiding this comment.
remove now or when AdamW_ADV has it?
| PRODIGY = 'PRODIGY' | ||
| PRODIGY_PLUS_SCHEDULE_FREE = 'PRODIGY_PLUS_SCHEDULE_FREE' | ||
| PRODIGY_ADV = 'PRODIGY_ADV' | ||
| LION_PRODIGY_ADV = 'LION_PRODIGY_ADV' |
There was a problem hiding this comment.
Removed, heuristic and unstable.
There was a problem hiding this comment.
What about any of the original yogi/lion?
There was a problem hiding this comment.
Lion is famous enough not to be removed (and it works well if you're training small datasets over short periods, which covers most OT use cases).
As for Yogi, I don't know how it works tbh, but I read somewhere that it’s a poor adjustment that causes Adam to diverge from its natural gradient mechanism.
This PR is for adv optimizers, but I can open another to remove outdated, and redundant optimziers.
There was a problem hiding this comment.
That would be appreciated, anything that’s been superseded (except AdamW) should be removed, our optimizer list is the paradox of choice right now
| optimizer_config.beta2 if optimizer_config.beta2 is not None else 0.99), | ||
| eps=optimizer_config.eps if optimizer_config.eps is not None else 1e-8, | ||
| weight_decay=optimizer_config.weight_decay if optimizer_config.weight_decay is not None else 0.0, | ||
| use_bias_correction=optimizer_config.use_bias_correction if optimizer_config.use_bias_correction is not None else True, |
There was a problem hiding this comment.
It should always be ON. I don't see a use case for disabling it, so just freeing up some room.
| "compile": False, | ||
| "fused_back_pass": False, | ||
| "use_atan2": False, | ||
| "use_atan2": True, |
There was a problem hiding this comment.
Unrelated, but it's a lot better than ADOPT heuristic high epsilon + clipping.
dxqb
left a comment
There was a problem hiding this comment.
looks good to me, but not my area of expertise
No description provided.