Submission/qat bigram12k stride32 by fbedev · Pull Request #348 · openai/parameter-golf

fbedev · 2026-03-21T15:22:55Z

QAT + BigramHash(12288) + Stride 32 — 1.1444 bpb

Body:

Summary

QAT with STE (int5 MLP / int6 attn) reduces post-quantization degradation
BigramHash increased from 10240 to 12288
Eval stride reduced from 64 to 32
Magnitude pruning 5%, SWA every 25 steps
Artifact: 15.90MB

Results

seed=2024: val_bpb=1.14443
8xH100 SXM, 6549 steps in 600s

Base

Built on records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50

Based on SOTA (10L_Int5MLP_MuonWD04_SWA50) with improvements: - QAT with STE for int5/int6 quantization-aware training - BigramHash increased from 10240 to 12288 - Eval stride reduced from 64 to 32 for better context - Magnitude pruning increased from 3% to 5% - SWA every 25 steps instead of 50 - Artifact size: ~15.89MB (under 16MB limit)

Restore original train_gpt.py baseline. Add new records folder with submission script based on 10L_Int5MLP_MuonWD04_SWA50 SOTA. Changes: QAT with STE, BigramHash 12288, eval stride 32, 5% magnitude pruning, SWA every 25 steps.

Port LoRA TTT from records/2026-03-17_LoRA_TTT into our submission. At eval time, per-document rank-8 LoRA adapters are trained on Q/V projections and lm_head, then used for scoring. Expected -0.003 to -0.005 bpb improvement on top of sliding window eval.

val_bpb=1.14443 (seed=2024), artifact=15.90MB

Copilot

Pull request overview

Adds a new /records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32 submission artifact folder capturing a training run and the exact code/config used for a QAT + BigramHash(12K) + stride-32 sliding-window evaluation entry.

Changes:

Adds a full train_gpt.py snapshot implementing QAT (STE fake-quant), BigramHash embeddings, mixed int5/int6 quantization, pruning, SWA, and sliding-window eval.
Adds a train_seed2024.log run log and a short README.md describing the approach/results.
Adds submission.json metadata (reported val_loss/bytes_total/date/author).

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File	Description
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_gpt.py	Training + quantization + export + eval script snapshot for this record submission.
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_seed2024.log	Captured training/eval log for seed 2024 and reported final metrics.
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/submission.json	Leaderboard metadata for the submission (name, loss, size, blurb).
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/README.md	Human-readable summary, config highlights, and run command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-21T15:26:39Z

records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/README.md

+- **BigramHash 12288:** Increased from 10240 to 12288 buckets for better bigram coverage.
+- **Eval stride 32:** Reduced from 64 to 32 for more overlapping context windows during evaluation.
+- **Magnitude pruning 5%:** Increased from 3% to improve compression ratio.
+- **SWA every 25 steps:** More frequent checkpoint averaging during warmdown.


README claims “SWA every 25 steps”, but the actual default in train_gpt.py is swa_every = ... 50 (and the included log shows SWA starting at step 5400, consistent with 50-step cadence). Please either update the README to match the code, or change the default/command/env vars so the run truly uses SWA every 25 steps and regenerate the log/metrics accordingly.

Suggested change

- **SWA every 25 steps:** More frequent checkpoint averaging during warmdown.

- **SWA every 50 steps:** Checkpoint averaging during warmdown.

Copilot · 2026-03-21T15:26:39Z

records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/submission.json

+  "name": "QAT + BigramHash(12288) + Stride 32",
+  "val_loss": 1.14443,
+  "bytes_total": 15902583,
+  "blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",


submission.json blurb states “SWA every 25 steps”, but train_gpt.py defaults to SWA_EVERY=50. For reproducibility, please align the blurb with the actual run configuration (or adjust the code/run to match the blurb and update the reported metrics if they change).

Suggested change

"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",

"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",

Copilot · 2026-03-21T15:26:39Z

records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_gpt.py

+except ImportError:
+    _COMPRESSOR = "zlib"


The script silently falls back to zlib when zstandard isn’t installed, but the record metadata/README call out “zstd-22” and the reported bytes_total depends on the compressor. For reproducibility, consider failing fast when zstandard is missing (or at least reflecting the fallback clearly in README/submission metadata and the logged size label).

Suggested change

except ImportError:

_COMPRESSOR = "zlib"

except ImportError as exc:

raise RuntimeError(

"The `zstandard` package is required for this script to run reproducibly "

"with the documented 'zstd-22' compression. Please install it with "

"`pip install zstandard` and try again."

) from exc

records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_gpt.py

…/train_gpt.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…le, EMA, Late QAT, TTT Major rewrite targeting top-5 leaderboard: - 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB - XSA (Exclusive Self-Attention) on last 4 layers - Partial RoPE: 16/64 head dims get position encoding - LN Scale: 1/sqrt(layer+1) dampening on deeper layers - EMA (decay=0.997) replaces SWA - Late QAT: STE int6 enabled only in final 4% of training - TTT: 25-epoch SGD on val data post-quantization - FA3 auto-detection with SDPA fallback - Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)

…y 10 steps - Disable FA3 (SDPA faster for GQA on PyTorch 2.9) - BigramHash 10240 -> 8192 to fit 11L under 16MB - EMA update every 10 steps with adjusted decay to reduce CPU overhead - Simplify attention forward (remove FA3 code path)

Previous run: 16.94MB with BigramHash 8192 + 5% pruning. BigramHash 2048 saves ~0.5MB, 10% pruning improves compression further.

v3 was 16.38MB with BigramHash 2048 + 10% pruning. Removing BigramHash saves ~0.15MB, 15% pruning improves zstd compression.

Fork of unnir's openai#374 (1.1246 BPB) with TTT added: - 11L, XSA4, Partial RoPE 16/64, LN Scale, Tight SWA - Shared VE128, SmearGate, BigramHash 2048 - TTT: 25 epochs SGD on val data post-quantization - Trimmed to 1476 lines (under 1500 limit)

Previous TTT took 7+ min per epoch (uncompiled, single GPU). Now: torch.compile + DDP across 8 GPUs + 3 epochs + batch 64. Should finish in ~2-3 min total.

flash_attn_interface (FA3 Hopper) not available on RunPod. Falls back to flash_attn, then SDPA with GQA support.

fbedev added 6 commits March 21, 2026 20:35

Add submission: QAT + BigramHash 12K + Stride 32

bccc688

Restore original train_gpt.py baseline. Add new records folder with submission script based on 10L_Int5MLP_MuonWD04_SWA50 SOTA. Changes: QAT with STE, BigramHash 12288, eval stride 32, 5% magnitude pruning, SWA every 25 steps.

Remove TTT, bump BigramHash to 13312

db5c5dd

Revert BigramHash to 12288 (13312 over 16MB)

65f54ac

Add training log and update submission with 8xH100 results

32790dd

val_bpb=1.14443 (seed=2024), artifact=15.90MB

Copilot AI review requested due to automatic review settings March 21, 2026 15:22

Copilot started reviewing on behalf of fbedev March 21, 2026 15:23 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

fbedev and others added 2 commits March 21, 2026 23:27

Fix SWA description: 50 steps not 25

4bd048c

Update records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32…

a3b1212

…/train_gpt.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

fbedev added 7 commits March 22, 2026 10:13

Reduce BigramHash to 2048, increase pruning to 10% to fit under 16MB

943597d

Previous run: 16.94MB with BigramHash 8192 + 5% pruning. BigramHash 2048 saves ~0.5MB, 10% pruning improves compression further.

Remove BigramHash, increase pruning to 15% — must fit under 16MB

308ed62

v3 was 16.38MB with BigramHash 2048 + 10% pruning. Removing BigramHash saves ~0.15MB, 15% pruning improves zstd compression.

Fix TTT: compile + DDP + 3 epochs + batch 64 for speed

4c37972

Previous TTT took 7+ min per epoch (uncompiled, single GPU). Now: torch.compile + DDP across 8 GPUs + 3 epochs + batch 64. Should finish in ~2-3 min total.

Fix FA3 import: add fallback to flash_attn and SDPA

e83a277

flash_attn_interface (FA3 Hopper) not available on RunPod. Falls back to flash_attn, then SDPA with GQA support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission/qat bigram12k stride32#348

Submission/qat bigram12k stride32#348
fbedev wants to merge 15 commits intoopenai:mainfrom
fbedev:submission/qat-bigram12k-stride32

fbedev commented Mar 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	- SWA every 25 steps: More frequent checkpoint averaging during warmdown.
	- SWA every 50 steps: Checkpoint averaging during warmdown.

	"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
	"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",

-except ImportError:
-    _COMPRESSOR = "zlib"
+except ImportError as exc:
+    raise RuntimeError(
+        "The `zstandard` package is required for this script to run reproducibly "
+        "with the documented 'zstd-22' compression. Please install it with "
+        "`pip install zstandard` and try again."
+    ) from exc

Conversation

fbedev commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Base

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fbedev commented Mar 21, 2026 •

edited

Loading