Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320)#332
Open
saml212 wants to merge 2 commits intoopenai:mainfrom
Open
Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320)#332saml212 wants to merge 2 commits intoopenai:mainfrom
saml212 wants to merge 2 commits intoopenai:mainfrom
Conversation
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 21, 2026
9b2aec3 to
4b062e0
Compare
RyanLisse
added a commit
to RyanLisse/parameter-golf
that referenced
this pull request
Mar 21, 2026
New CUDA presets: - pr332_12l_xsa: 12L/2xMLP, seq2048, momentum 0.99 (from PR openai#332) - pr338_11l_ttt: 11L/2xMLP, seq2048, momentum 0.99 (from PR openai#338) - bft_ensemble: 9L/3xMLP Byzantine fault tolerant checkpoint config - difficulty_adjusted: 10L/2xMLP adaptive search with tight LR - partial_rope_headtemp: baseline arch with novel attention params Expanded search: NUM_LAYERS includes 11, TRAIN_SEQ_LEN includes 4096. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
saml212
added a commit
to saml212/parameter-golf
that referenced
this pull request
Mar 23, 2026
Updated CLAUDE.md and idea bank with: - Current valid leaderboard (PR openai#414 at 1.1233 is the real leader) - TTT legality analysis (full-val TTT ruled invalid, score-first legal) - New techniques to adopt: GPTQ-lite, backout, U-Net skips, value residual, catalytic residuals, gated attention - Phased experiment roadmap: parity -> zero-cost arch -> novel quant -> training - Dead ends confirmed since openai#332: PPM-C, SwiGLU, depth recurrence
rarce
added a commit
to rarce/parameter-golf
that referenced
this pull request
Mar 23, 2026
Replaces aggressive int6-all + 10% pruning with targeted approach: - MLP_HIDDEN=1408 (vs 1536): saves ~1.44M params (~1MB compressed) Following PR openai#332 which uses 1408 for its 12-layer model - Int6 on layers 1-9, keep layer 0 and 10 at int8 (input/output quality) - No magnitude pruning (preserves model quality) Expected artifact: ~15.5 MB (down from 18 MB) MLP_HIDDEN env var overrides mlp_mult*dim when > 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4b062e0 to
51f433d
Compare
Contributor
Author
|
3-seed validated, clean diff |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: 12L Gradient-Guided Quant + Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1320)
val_bpb: 1.1320 (sliding window, stride=64) | 15.7 MB | 8xH100 SXM, 600s
Progress from prior submissions
What's new
Gradient-Guided Adaptive Quantization. Standard int6 quantization treats all weight tensors equally, but not all tensors are equally sensitive to quantization noise. We accumulate per-tensor squared gradient magnitudes during the last 10% of warmdown (zero throughput cost — gradients are already computed), then rank tensors by sensitivity at quantization time:
This adaptive allocation saves ~1 MB vs uniform int6, funding a 12th transformer layer while staying under 16 MB.
12 layers (up from 9). Extra depth funded by gradient-guided compression headroom. MLP narrowed to 1408 (from 1536 at 11L) — extra depth outweighs narrower width at this scale.
Batch=524K. Reducing batch size from 786K to 524K gives 22% more optimization steps (8,060 vs ~7,000) at lower per-step cost (74ms vs ~84ms). More gradient updates outweigh larger batch quality in a fixed-time budget.
Partial RoPE (16 of 64 dims). Rotary embeddings applied to only 25% of head dimensions. Remaining dims use position-free attention, improving generalization. Zero new parameters.
LN Scale. RMSNorm outputs scaled by 1/sqrt(layer_idx+1). Damps deeper layers' contributions, stabilizing training at 12 layers. Zero new parameters.
XSA (Exclusive Self Attention) on last 4 layers. Removes self-value bias from attention output via orthogonal projection. Forces attention to carry cross-token information only. Zero new parameters.
EMA (decay=0.997) replacing SWA. Exponential moving average every step instead of periodic checkpoint averaging. Smoother weight distribution, better generalization and compression.
Negative finding: Late QAT at 12 layers
We tested Late QAT (STE int6 fake-quantization in the last 4% of training). At 12 layers the per-step overhead (~7ms) forces a lower wallclock cap, costing ~770 training steps. The lost model quality exceeds the quantization improvement: 1.1361 (with Late QAT) vs 1.1321 (without). Late QAT's value depends on the step budget — at high layer counts where step time is already elevated, the throughput cost dominates.
Results
Reproducibility (3 seeds)
Mean: 1.1320 | Std: 0.0002 | Submitted: seed 1337
Run command