Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256) by alertcat · Pull Request #338 · openai/parameter-golf

alertcat · 2026-03-21T11:07:02Z

11L XSA + EMA + TTT + Int6 MLP3x

val_bpb = 1.1254 (sliding window stride=64, best seed 42) | 15.55 MB artifact | 8xH100 SXM, 600s

Key Innovation: TTT on XSA+EMA baseline

First submission combining XSA (Exclusive Self Attention) + EMA + Test-Time Training. After training and quantization, TTT performs 3 epochs of SGD fine-tuning on the validation token stream, adapting the model to the test distribution.

Results (3-seed, 8xH100 SXM)

Seed	Steps	Sliding BPB (s64)	Artifact
1337	7,070	1.1258	15.55 MB
42	7,068	1.1254	15.55 MB
2024	7,069	1.1256	15.55 MB

Mean: 1.1256 | Std: 0.0002

TTT Details

3 epochs SGD on validation tokens (lr=0.002, momentum=0.9)
First 2 transformer blocks frozen for stability
~47 seconds on 8xH100 (well under 600s eval limit)
Improves post-quant BPB by ~0.002

Architecture (from PR #315)

11L, 512d, 8H/4KV, MLP 3x, relu-squared
XSA on last 4 layers, EMA (decay=0.997)
SmearGate + BigramHash(2048) + OrthoInit
Int6 QAT + Late QAT + zstd-22
FlashAttention 3, Muon WD=0.04

Eval Timing

Training: 600s | TTT: 47s | Sliding eval: 73s | Total eval: ~120s

Reproduction

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: zstandard in e:�naconda\lib\site-packages (0.23.0)

Built on PR #315 (XSA, EMA, SmearGate, BigramHash, OrthoInit, sliding window eval).

Innovation over PR openai#198 (SOTA 1.1318): - 12 transformer layers (was 11): +2.2M params, better representation - Int5 quantization for MLP weights [-16,15]: 3 zero high bits - zstd compression 1.88x vs int6 1.51x, saves ~1.8MB - Funds the 12th layer within 16MB budget - Int6 kept for attention weights (precision-sensitive) - FA3 fallback for older PyTorch - LR=0.025 (validated as optimal in A/B testing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New CUDA presets: - pr332_12l_xsa: 12L/2xMLP, seq2048, momentum 0.99 (from PR openai#332) - pr338_11l_ttt: 11L/2xMLP, seq2048, momentum 0.99 (from PR openai#338) - bft_ensemble: 9L/3xMLP Byzantine fault tolerant checkpoint config - difficulty_adjusted: 10L/2xMLP adaptive search with tight LR - partial_rope_headtemp: baseline arch with novel attention params Expanded search: NUM_LAYERS includes 11, TRAIN_SEQ_LEN includes 4096. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move EMA shadow weights to GPU (CPU transfers cost ~32% throughput) - Increase train seq_len from 1024 to 2048 (matches record PR openai#338) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

3-seed mean: 1.1371 (seeds 42, 7, 2024) Dynamic evaluation (Krause et al., ICML 2018) applied during sliding window scoring. 2.0% consistent bpb improvement at zero artifact cost. Built on PR openai#315 (jfprincz) and PR openai#338 (alertcat).

v21: 11L + no-QAT + SWA + TTT + SmearGate + OrthoInit (1.1393 BPB) v24: PR openai#338 SOTA stack (partial RoPE, LN scale, late QAT, XSA4, EMA) run_modal.py: Modal cloud runner for 8xH100 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y

New addition: EMA (decay=0.9999) shadow model, eval uses EMA weights. EMA coexists with SWA. Zero artifact cost. Consistent with PR openai#338 (best open PR, 1.1254 bpb) which also uses EMA. 11th layer ruled out: needs ~0.91MB, only ~0.36MB budget available. Full stack on thwu1 base (1.1428): - TrigramHash(20480, dim=32): trigram embeddings, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers (PR openai#287) - EMA: decay=0.9999, shadow model used at final eval - TTT: 3-epoch SGD on val tokens, all ranks, ~47s budget Artifact: ~15.64MB. H100 validation pending.

T4 ablation (1000 steps, 4 variants): V2 bigram=10240 no trigram: 5.4379 loss WINNER V4 bigram=8192 + trigram=8192: 5.6956 loss V3 bigram=4096 + trigram=20480: 5.7924 loss (was our submission) V1 bigram=4096 no trigram: 5.8414 loss TrigramHash adds noise, bigram reduction actively hurts. Restored bigram=10240. Stack is now: XSA + EMA + TTT on thwu1 base. These are proven techniques (XSA from PR openai#287, EMA+TTT from PR openai#338 lineage) applied cleanly on the openai#1 submission.

alertcat and others added 9 commits March 20, 2026 21:22

PR198 SOTA + FA3 fallback + LR0.025 + run script

3157704

Fix: enable QAT (was 0, should be 1) - reduces quant loss 3x

bedcff8

8xH100 3-seed results: sliding BPB 1.1539-1.1543

7511dff

Add non-record submission: 12L Int5-MLP, sliding BPB 1.1541

d21e7fb

Fix: TTT code on main, BigramHash=2048, FA3 install script

0c9924a

Fix: add zstandard install (critical for <16MB), update run script

a02847c

3-seed PR315+TTT: sliding BPB 1.1254-1.1258, artifact 15.55MB

934b4a6

Record: 11L XSA+EMA+TTT, sliding BPB 1.1254

5d7082e

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

sheeki03 mentioned this pull request Mar 21, 2026

Record: 11L Backout + Int6 + SWA (val_bpb: 1.1364) #339

Open

shivnarainms22 mentioned this pull request Mar 21, 2026

Non-record: 10L Int5-MLP + TTT + Backout Connection (val_bpb=1.1574 on 8xH100 SXM) #366

Open

translatingthename mentioned this pull request Mar 22, 2026

Record: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364) #397

Open

leloykun mentioned this pull request Mar 22, 2026

Invalid submissions due to information leakage during TTT #402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256)#338

Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256)#338
alertcat wants to merge 9 commits intoopenai:mainfrom
alertcat:submission-pr315-ttt

alertcat commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alertcat commented Mar 21, 2026

11L XSA + EMA + TTT + Int6 MLP3x

Key Innovation: TTT on XSA+EMA baseline

Results (3-seed, 8xH100 SXM)

TTT Details

Architecture (from PR #315)

Eval Timing

Reproduction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant