Add record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864) by andreanjos · Pull Request #321 · openai/parameter-golf

andreanjos · 2026-03-21T07:04:44Z

Summary

Optimizer tuning (warmdown=10000, muon_backend_steps=10, grad_clip=1.0, beta2=0.99, scalar_lr=0.02) + seq2048 training +
sliding window evaluation (stride=64)
Same 9-layer 512dim ReLU² architecture as baseline — no model architecture changes
Post-quant int8+zlib artifact under 16,000,000-byte cap

Results (8xH100 SXM)

Seed	Steps	val_loss	val_bpb	Artifact
1337	11,520	2.00321	1.18642	15,861,337
1338	11,520	2.00428	1.18705	15,859,751
1339	11,523	2.00667	1.18847	15,867,480

Mean val_loss: 2.00472 vs SOTA 2.01348 → improvement of 0.00876 nats
t=8.57, df=2, p < 0.01

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ted activation is more expressive at same param count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uses 16MB budget better with SwiGLU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…per SwiGLU model benefits from higher Muon LR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sis: deeper SwiGLU model benefits from higher Muon LR" This reverts commit 0365016.

…er capping sharpens predictions for small vocab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rtifact headroom for wider model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s: use artifact headroom for wider model" This reverts commit 090f343.

…: longer LR decay improves final convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…better orthogonalization improves gradient quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…: stabilizes deeper 11-layer model training Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…oother second moment helps embedding/scalar optimization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…= 12 effective layers at dim=576 Shares weights between encoder and decoder passes. Frees param budget to increase width from 512→576. This is a structural change: fewer unique params, deeper effective network, wider representation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… passes = 12 effective layers at dim=576" This reverts commit fdd589a.

…per initial attention helps learning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…is: sharper initial attention helps learning" This reverts commit b311be8.

…sis: more MLP capacity using artifact headroom Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… hypothesis: more MLP capacity using artifact headroom" This reverts commit 5749817.

… faster embedding convergence improves BPB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…othesis: faster embedding convergence improves BPB" This reverts commit c7d9c05.

…thesis: more differentiated initial embeddings help early learning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…1 — hypothesis: more differentiated initial embeddings help early learning" This reverts commit 0de4236.

…ypothesis: full momentum from start helps early convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eps) — hypothesis: full momentum from start helps early convergence" This reverts commit dbc921b.

… hypothesis: less outlier accommodation reduces quant error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…99.995 — hypothesis: less outlier accommodation reduces quant error" This reverts commit b1c8cdb.

…r positional decay helps with shorter sequences Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s: faster positional decay helps with shorter sequences" This reverts commit e6f0525.

… heads reduce params while maintaining quality with small vocab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…fewer KV heads reduce params while maintaining quality with small vocab" This reverts commit e65930d.

…ore aggressive clipping aids convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hesis: more aggressive clipping aids convergence" This reverts commit 3283408.

…aw logits allow sharper predictions for compression Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hesis: raw logits allow sharper predictions for compression" This reverts commit d42ebd3.

… more NS iterations improve gradient conditioning further Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…p16 instead of int8 quantization The tied embedding serves as both input and output head. Int8 quantization degrades it significantly (~0.007 BPB). FP16 passthrough costs ~500KB extra but nearly eliminates the quant error for this tensor (~0.0005 BPB). Proven by FP16Embed submission on the leaderboard (1.2197 BPB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ch token with near-max context Instead of non-overlapping seq_len chunks (avg 512 tokens context), use overlapping windows where each token is scored with 960+ tokens of context. Each token scored exactly once. Proven by SOTA submission (1.1925 BPB) — pure eval improvement of ~0.032. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…othesis: longer context during training improves model quality Proven by LongContext submission (1.2058 BPB, -0.019 from baseline). Steps are ~18% slower but quality gain is worth it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ntext per scored token improves BPB further Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… more context per scored token improves BPB further" This reverts commit 53d65d2.

…her Muon LR proven by FP16Embed submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sis: higher Muon LR proven by FP16Embed submission" This reverts commit 9473c45.

…rom Long Context submission, seq2048 needs lower embed LR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hesis: from Long Context submission, seq2048 needs lower embed LR" This reverts commit 80b33f0.

…s: on H100 this starts warmdown at ~56% through training for smoother convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rtifact headroom for deeper model With warmdown=10000 artifact compressed to 12.6MB. Adding 10th layer uses ~1.5MB extra at convergence. Previously 11 layers blew budget (18.8MB), but 10 should fit safely at ~14.1MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…room may accommodate 11th layer with warmdown=10000 Previously 11 layers blew budget at 18.8MB with warmdown=4800. With warmdown=10000, artifact compression is better — 10 layers was 13.4MB. 11th layer adds ~1.5MB → ~14.9MB estimated. Should fit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… further, 1.3MB headroom left 12 layers = ~22.6M params. With warmdown=10000, 11 layers was 14.7MB. 12th layer adds ~1.5MB → ~16.2MB estimated. Tight but may fit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…is: more aggressive LR decay compresses 12-layer artifact, adds safety margin Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hypothesis: more aggressive LR decay compresses 12-layer artifact, adds safety margin" This reverts commit e85b313.

…mentation Bug: previous impl always scored last stride tokens, missing first-window-scores-all logic and variable-length window handling. Caused BPB degradation instead of improvement. New impl matches the SOTA submission's approach exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…is: slightly more decay helps 12-layer model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hypothesis: slightly more decay helps 12-layer model" This reverts commit 7bc304c.

…hesis: smaller init for deeper 12-layer model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… — hypothesis: smaller init for deeper 12-layer model" This reverts commit 657ceda.

…BPB, 15.2MB artifact 12 layers (17.2MB) and 11 layers (16.3MB) both blew the 16MB budget at H100 convergence. 10 layers fits at 15.2MB with 790KB headroom. On 4xH100 in 10 min: 4663 steps, 1.2074 BPB. On 8xH100 expect ~9000+ steps and even better BPB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…/3→11/20) 12 layers blew 16MB at convergence with hidden=682. Reduce to hidden=563 (factor 11/20). Estimated: 20.4M params → ~15.5MB artifact. Trades MLP width for 2 extra layers of depth. Ready to test on next H100 run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…4 BPB on 8xH100 Final config: 9 layers, ReLU² MLP, 512dim, 2048 seq_len, warmdown=10000, muon_backend_steps=10, grad_clip=1.0, beta2=0.99, scalar_lr=0.02, sliding window eval stride=64. Artifact 15.86MB (140KB under 16MB limit). Beats current SOTA (1.1925) by 0.006 nats. 11,520 steps in 10 min at 52ms/step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

autoresearch and others added 30 commits March 18, 2026 15:27

chore: gitignore autoresearch artifacts

774b65a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): replace ReLU² MLP with SwiGLU — hypothesis: ga…

326a423

…ted activation is more expressive at same param count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): increase layers 9→11 — hypothesis: more depth …

2fc37a9

…uses 16MB budget better with SwiGLU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): increase matrix_lr 0.04→0.05 — hypothesis: dee…

0365016

…per SwiGLU model benefits from higher Muon LR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): increase matrix_lr 0.04→0.05 — hypothe…

e9c6026

…sis: deeper SwiGLU model benefits from higher Muon LR" This reverts commit 0365016.

experiment(train_gpt): reduce logit_softcap 30→20 — hypothesis: tight…

0275bf8

…er capping sharpens predictions for small vocab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): increase model_dim 512→528 — hypothesis: use a…

090f343

…rtifact headroom for wider model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): increase model_dim 512→528 — hypothesi…

5bc1a71

…s: use artifact headroom for wider model" This reverts commit 090f343.

experiment(train_gpt): increase warmdown_iters 1200→2400 — hypothesis…

4dac192

…: longer LR decay improves final convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): increase muon_backend_steps 5→8 — hypothesis: …

11d6c5f

…better orthogonalization improves gradient quality Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): enable gradient clipping norm=1.0 — hypothesis…

548b850

…: stabilizes deeper 11-layer model training Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): increase Adam beta2 0.95→0.99 — hypothesis: sm…

fed2741

…oother second moment helps embedding/scalar optimization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): depth recurrence — 6 unique blocks × 2…

9074a2c

… passes = 12 effective layers at dim=576" This reverts commit fdd589a.

experiment(train_gpt): reduce qk_gain_init 1.5→1.0 — hypothesis: shar…

b311be8

…per initial attention helps learning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): reduce qk_gain_init 1.5→1.0 — hypothes…

c0828a1

…is: sharper initial attention helps learning" This reverts commit b311be8.

experiment(train_gpt): increase SwiGLU hidden 2/3→3/4 ratio — hypothe…

5749817

…sis: more MLP capacity using artifact headroom Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): increase SwiGLU hidden 2/3→3/4 ratio —…

b758bb3

… hypothesis: more MLP capacity using artifact headroom" This reverts commit 5749817.

experiment(train_gpt): increase tied_embed_lr 0.05→0.08 — hypothesis:…

c7d9c05

… faster embedding convergence improves BPB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): increase tied_embed_lr 0.05→0.08 — hyp…

4d4dad3

…othesis: faster embedding convergence improves BPB" This reverts commit c7d9c05.

experiment(train_gpt): increase tied_embed_init_std 0.005→0.01 — hypo…

0de4236

…thesis: more differentiated initial embeddings help early learning Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): increase tied_embed_init_std 0.005→0.0…

d01a7e4

…1 — hypothesis: more differentiated initial embeddings help early learning" This reverts commit 0de4236.

experiment(train_gpt): disable muon momentum warmup (500→0 steps) — h…

dbc921b

…ypothesis: full momentum from start helps early convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): disable muon momentum warmup (500→0 st…

5d6ee0b

…eps) — hypothesis: full momentum from start helps early convergence" This reverts commit dbc921b.

experiment(train_gpt): tighter int8 clip percentile 99.99984→99.995 —…

b1c8cdb

… hypothesis: less outlier accommodation reduces quant error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): tighter int8 clip percentile 99.99984→…

8934572

…99.995 — hypothesis: less outlier accommodation reduces quant error" This reverts commit b1c8cdb.

experiment(train_gpt): reduce rope_base 10000→500 — hypothesis: faste…

e6f0525

…r positional decay helps with shorter sequences Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): reduce rope_base 10000→500 — hypothesi…

68a16a9

…s: faster positional decay helps with shorter sequences" This reverts commit e6f0525.

experiment(train_gpt): reduce num_kv_heads 4→2 — hypothesis: fewer KV…

e65930d

… heads reduce params while maintaining quality with small vocab Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): reduce num_kv_heads 4→2 — hypothesis: …

965521e

…fewer KV heads reduce params while maintaining quality with small vocab" This reverts commit e65930d.

autoresearch and others added 29 commits March 19, 2026 10:03

experiment(train_gpt): tighten grad_clip_norm 1.0→0.5 — hypothesis: m…

3283408

…ore aggressive clipping aids convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): tighten grad_clip_norm 1.0→0.5 — hypot…

e733b20

…hesis: more aggressive clipping aids convergence" This reverts commit 3283408.

experiment(train_gpt): disable logit_softcap entirely — hypothesis: r…

d42ebd3

…aw logits allow sharper predictions for compression Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): disable logit_softcap entirely — hypot…

42ef022

…hesis: raw logits allow sharper predictions for compression" This reverts commit d42ebd3.

experiment(train_gpt): increase muon_backend_steps 8→10 — hypothesis:…

5ea1f92

… more NS iterations improve gradient conditioning further Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): reduce eval_stride 64→32 — hypothesis: more co…

53d65d2

…ntext per scored token improves BPB further Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): reduce eval_stride 64→32 — hypothesis:…

6780f07

… more context per scored token improves BPB further" This reverts commit 53d65d2.

experiment(train_gpt): increase matrix_lr 0.04→0.06 — hypothesis: hig…

9473c45

…her Muon LR proven by FP16Embed submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): increase matrix_lr 0.04→0.06 — hypothe…

01d2dcc

…sis: higher Muon LR proven by FP16Embed submission" This reverts commit 9473c45.

experiment(train_gpt): reduce tied_embed_lr 0.05→0.04 — hypothesis: f…

80b33f0

…rom Long Context submission, seq2048 needs lower embed LR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): reduce tied_embed_lr 0.05→0.04 — hypot…

9d113f6

…hesis: from Long Context submission, seq2048 needs lower embed LR" This reverts commit 80b33f0.

experiment(train_gpt): increase warmdown_iters 4800→10000 — hypothesi…

ec979ff

…s: on H100 this starts warmdown at ~56% through training for smoother convergence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

experiment(train_gpt): increase warmdown_iters 10000→15000 — hypothes…

e85b313

…is: more aggressive LR decay compresses 12-layer artifact, adds safety margin Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): increase warmdown_iters 10000→15000 — …

921b5ab

…hypothesis: more aggressive LR decay compresses 12-layer artifact, adds safety margin" This reverts commit e85b313.

experiment(train_gpt): increase warmdown_iters 10000→12000 — hypothes…

7bc304c

…is: slightly more decay helps 12-layer model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): increase warmdown_iters 10000→12000 — …

da9fbc8

…hypothesis: slightly more decay helps 12-layer model" This reverts commit 7bc304c.

experiment(train_gpt): reduce tied_embed_init_std 0.005→0.002 — hypot…

657ceda

…hesis: smaller init for deeper 12-layer model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert "experiment(train_gpt): reduce tied_embed_init_std 0.005→0.002…

392c220

… — hypothesis: smaller init for deeper 12-layer model" This reverts commit 657ceda.

Add record: Optimizer Tuning + Sliding Window Eval, val_bpb=1.1864

1060c36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864)#321

Add record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864)#321
andreanjos wants to merge 76 commits intoopenai:mainfrom
andreanjos:autoresearch/parameter-golf

andreanjos commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andreanjos commented Mar 21, 2026

Summary

Results (8xH100 SXM)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant