Add record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864)#321
Open
andreanjos wants to merge 76 commits intoopenai:mainfrom
Open
Add record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864)#321andreanjos wants to merge 76 commits intoopenai:mainfrom
andreanjos wants to merge 76 commits intoopenai:mainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ted activation is more expressive at same param count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uses 16MB budget better with SwiGLU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…per SwiGLU model benefits from higher Muon LR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sis: deeper SwiGLU model benefits from higher Muon LR" This reverts commit 0365016.
…er capping sharpens predictions for small vocab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rtifact headroom for wider model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s: use artifact headroom for wider model" This reverts commit 090f343.
…: longer LR decay improves final convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…better orthogonalization improves gradient quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…: stabilizes deeper 11-layer model training Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oother second moment helps embedding/scalar optimization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…= 12 effective layers at dim=576 Shares weights between encoder and decoder passes. Frees param budget to increase width from 512→576. This is a structural change: fewer unique params, deeper effective network, wider representation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… passes = 12 effective layers at dim=576" This reverts commit fdd589a.
…per initial attention helps learning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…is: sharper initial attention helps learning" This reverts commit b311be8.
…sis: more MLP capacity using artifact headroom Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… hypothesis: more MLP capacity using artifact headroom" This reverts commit 5749817.
… faster embedding convergence improves BPB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…othesis: faster embedding convergence improves BPB" This reverts commit c7d9c05.
…thesis: more differentiated initial embeddings help early learning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1 — hypothesis: more differentiated initial embeddings help early learning" This reverts commit 0de4236.
…ypothesis: full momentum from start helps early convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eps) — hypothesis: full momentum from start helps early convergence" This reverts commit dbc921b.
… hypothesis: less outlier accommodation reduces quant error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…99.995 — hypothesis: less outlier accommodation reduces quant error" This reverts commit b1c8cdb.
…r positional decay helps with shorter sequences Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s: faster positional decay helps with shorter sequences" This reverts commit e6f0525.
… heads reduce params while maintaining quality with small vocab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fewer KV heads reduce params while maintaining quality with small vocab" This reverts commit e65930d.
…ore aggressive clipping aids convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hesis: more aggressive clipping aids convergence" This reverts commit 3283408.
…aw logits allow sharper predictions for compression Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hesis: raw logits allow sharper predictions for compression" This reverts commit d42ebd3.
… more NS iterations improve gradient conditioning further Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…p16 instead of int8 quantization The tied embedding serves as both input and output head. Int8 quantization degrades it significantly (~0.007 BPB). FP16 passthrough costs ~500KB extra but nearly eliminates the quant error for this tensor (~0.0005 BPB). Proven by FP16Embed submission on the leaderboard (1.2197 BPB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ch token with near-max context Instead of non-overlapping seq_len chunks (avg 512 tokens context), use overlapping windows where each token is scored with 960+ tokens of context. Each token scored exactly once. Proven by SOTA submission (1.1925 BPB) — pure eval improvement of ~0.032. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…othesis: longer context during training improves model quality Proven by LongContext submission (1.2058 BPB, -0.019 from baseline). Steps are ~18% slower but quality gain is worth it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntext per scored token improves BPB further Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… more context per scored token improves BPB further" This reverts commit 53d65d2.
…her Muon LR proven by FP16Embed submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sis: higher Muon LR proven by FP16Embed submission" This reverts commit 9473c45.
…rom Long Context submission, seq2048 needs lower embed LR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hesis: from Long Context submission, seq2048 needs lower embed LR" This reverts commit 80b33f0.
…s: on H100 this starts warmdown at ~56% through training for smoother convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rtifact headroom for deeper model With warmdown=10000 artifact compressed to 12.6MB. Adding 10th layer uses ~1.5MB extra at convergence. Previously 11 layers blew budget (18.8MB), but 10 should fit safely at ~14.1MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…room may accommodate 11th layer with warmdown=10000 Previously 11 layers blew budget at 18.8MB with warmdown=4800. With warmdown=10000, artifact compression is better — 10 layers was 13.4MB. 11th layer adds ~1.5MB → ~14.9MB estimated. Should fit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… further, 1.3MB headroom left 12 layers = ~22.6M params. With warmdown=10000, 11 layers was 14.7MB. 12th layer adds ~1.5MB → ~16.2MB estimated. Tight but may fit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…is: more aggressive LR decay compresses 12-layer artifact, adds safety margin Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hypothesis: more aggressive LR decay compresses 12-layer artifact, adds safety margin" This reverts commit e85b313.
…mentation Bug: previous impl always scored last stride tokens, missing first-window-scores-all logic and variable-length window handling. Caused BPB degradation instead of improvement. New impl matches the SOTA submission's approach exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…is: slightly more decay helps 12-layer model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hypothesis: slightly more decay helps 12-layer model" This reverts commit 7bc304c.
…hesis: smaller init for deeper 12-layer model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… — hypothesis: smaller init for deeper 12-layer model" This reverts commit 657ceda.
…BPB, 15.2MB artifact 12 layers (17.2MB) and 11 layers (16.3MB) both blew the 16MB budget at H100 convergence. 10 layers fits at 15.2MB with 790KB headroom. On 4xH100 in 10 min: 4663 steps, 1.2074 BPB. On 8xH100 expect ~9000+ steps and even better BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…/3→11/20) 12 layers blew 16MB at convergence with hidden=682. Reduce to hidden=563 (factor 11/20). Estimated: 20.4M params → ~15.5MB artifact. Trades MLP width for 2 extra layers of depth. Ready to test on next H100 run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…4 BPB on 8xH100 Final config: 9 layers, ReLU² MLP, 512dim, 2048 seq_len, warmdown=10000, muon_backend_steps=10, grad_clip=1.0, beta2=0.99, scalar_lr=0.02, sliding window eval stride=64. Artifact 15.86MB (140KB under 16MB limit). Beats current SOTA (1.1925) by 0.006 nats. 11,520 steps in 10 min at 52ms/step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sliding window evaluation (stride=64)
Results (8xH100 SXM)
Mean val_loss: 2.00472 vs SOTA 2.01348 → improvement of 0.00876 nats
t=8.57, df=2, p < 0.01