Skip to content

Add record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864)#321

Open
andreanjos wants to merge 76 commits intoopenai:mainfrom
andreanjos:autoresearch/parameter-golf
Open

Add record: Optimizer Tuning + Sliding Window Eval (val_bpb=1.1864)#321
andreanjos wants to merge 76 commits intoopenai:mainfrom
andreanjos:autoresearch/parameter-golf

Conversation

@andreanjos
Copy link

Summary

  • Optimizer tuning (warmdown=10000, muon_backend_steps=10, grad_clip=1.0, beta2=0.99, scalar_lr=0.02) + seq2048 training +
    sliding window evaluation (stride=64)
  • Same 9-layer 512dim ReLU² architecture as baseline — no model architecture changes
  • Post-quant int8+zlib artifact under 16,000,000-byte cap

Results (8xH100 SXM)

Seed Steps val_loss val_bpb Artifact
1337 11,520 2.00321 1.18642 15,861,337
1338 11,520 2.00428 1.18705 15,859,751
1339 11,523 2.00667 1.18847 15,867,480

Mean val_loss: 2.00472 vs SOTA 2.01348 → improvement of 0.00876 nats
t=8.57, df=2, p < 0.01

autoresearch and others added 30 commits March 18, 2026 15:27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ted activation is more expressive at same param count

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uses 16MB budget better with SwiGLU

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…per SwiGLU model benefits from higher Muon LR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sis: deeper SwiGLU model benefits from higher Muon LR"

This reverts commit 0365016.
…er capping sharpens predictions for small vocab

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rtifact headroom for wider model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s: use artifact headroom for wider model"

This reverts commit 090f343.
…: longer LR decay improves final convergence

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…better orthogonalization improves gradient quality

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…: stabilizes deeper 11-layer model training

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oother second moment helps embedding/scalar optimization

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…= 12 effective layers at dim=576

Shares weights between encoder and decoder passes. Frees param budget to increase width from 512→576.
This is a structural change: fewer unique params, deeper effective network, wider representation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… passes = 12 effective layers at dim=576"

This reverts commit fdd589a.
…per initial attention helps learning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…is: sharper initial attention helps learning"

This reverts commit b311be8.
…sis: more MLP capacity using artifact headroom

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… hypothesis: more MLP capacity using artifact headroom"

This reverts commit 5749817.
… faster embedding convergence improves BPB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…othesis: faster embedding convergence improves BPB"

This reverts commit c7d9c05.
…thesis: more differentiated initial embeddings help early learning

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1 — hypothesis: more differentiated initial embeddings help early learning"

This reverts commit 0de4236.
…ypothesis: full momentum from start helps early convergence

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eps) — hypothesis: full momentum from start helps early convergence"

This reverts commit dbc921b.
… hypothesis: less outlier accommodation reduces quant error

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…99.995 — hypothesis: less outlier accommodation reduces quant error"

This reverts commit b1c8cdb.
…r positional decay helps with shorter sequences

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s: faster positional decay helps with shorter sequences"

This reverts commit e6f0525.
… heads reduce params while maintaining quality with small vocab

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fewer KV heads reduce params while maintaining quality with small vocab"

This reverts commit e65930d.
autoresearch and others added 29 commits March 19, 2026 10:03
…ore aggressive clipping aids convergence

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hesis: more aggressive clipping aids convergence"

This reverts commit 3283408.
…aw logits allow sharper predictions for compression

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hesis: raw logits allow sharper predictions for compression"

This reverts commit d42ebd3.
… more NS iterations improve gradient conditioning further

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…p16 instead of int8 quantization

The tied embedding serves as both input and output head. Int8 quantization
degrades it significantly (~0.007 BPB). FP16 passthrough costs ~500KB extra
but nearly eliminates the quant error for this tensor (~0.0005 BPB).
Proven by FP16Embed submission on the leaderboard (1.2197 BPB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ch token with near-max context

Instead of non-overlapping seq_len chunks (avg 512 tokens context), use overlapping
windows where each token is scored with 960+ tokens of context. Each token scored
exactly once. Proven by SOTA submission (1.1925 BPB) — pure eval improvement of ~0.032.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…othesis: longer context during training improves model quality

Proven by LongContext submission (1.2058 BPB, -0.019 from baseline).
Steps are ~18% slower but quality gain is worth it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntext per scored token improves BPB further

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… more context per scored token improves BPB further"

This reverts commit 53d65d2.
…her Muon LR proven by FP16Embed submission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sis: higher Muon LR proven by FP16Embed submission"

This reverts commit 9473c45.
…rom Long Context submission, seq2048 needs lower embed LR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hesis: from Long Context submission, seq2048 needs lower embed LR"

This reverts commit 80b33f0.
…s: on H100 this starts warmdown at ~56% through training for smoother convergence

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rtifact headroom for deeper model

With warmdown=10000 artifact compressed to 12.6MB. Adding 10th layer uses ~1.5MB extra at convergence.
Previously 11 layers blew budget (18.8MB), but 10 should fit safely at ~14.1MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…room may accommodate 11th layer with warmdown=10000

Previously 11 layers blew budget at 18.8MB with warmdown=4800.
With warmdown=10000, artifact compression is better — 10 layers was 13.4MB.
11th layer adds ~1.5MB → ~14.9MB estimated. Should fit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… further, 1.3MB headroom left

12 layers = ~22.6M params. With warmdown=10000, 11 layers was 14.7MB.
12th layer adds ~1.5MB → ~16.2MB estimated. Tight but may fit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…is: more aggressive LR decay compresses 12-layer artifact, adds safety margin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hypothesis: more aggressive LR decay compresses 12-layer artifact, adds safety margin"

This reverts commit e85b313.
…mentation

Bug: previous impl always scored last stride tokens, missing first-window-scores-all
logic and variable-length window handling. Caused BPB degradation instead of improvement.
New impl matches the SOTA submission's approach exactly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…is: slightly more decay helps 12-layer model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hypothesis: slightly more decay helps 12-layer model"

This reverts commit 7bc304c.
…hesis: smaller init for deeper 12-layer model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… — hypothesis: smaller init for deeper 12-layer model"

This reverts commit 657ceda.
…BPB, 15.2MB artifact

12 layers (17.2MB) and 11 layers (16.3MB) both blew the 16MB budget at H100 convergence.
10 layers fits at 15.2MB with 790KB headroom. On 4xH100 in 10 min: 4663 steps, 1.2074 BPB.
On 8xH100 expect ~9000+ steps and even better BPB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…/3→11/20)

12 layers blew 16MB at convergence with hidden=682. Reduce to hidden=563 (factor 11/20).
Estimated: 20.4M params → ~15.5MB artifact. Trades MLP width for 2 extra layers of depth.
Ready to test on next H100 run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…4 BPB on 8xH100

Final config: 9 layers, ReLU² MLP, 512dim, 2048 seq_len, warmdown=10000,
muon_backend_steps=10, grad_clip=1.0, beta2=0.99, scalar_lr=0.02,
sliding window eval stride=64. Artifact 15.86MB (140KB under 16MB limit).

Beats current SOTA (1.1925) by 0.006 nats. 11,520 steps in 10 min at 52ms/step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant