Skip to content

Submission/qat bigram12k stride32#348

Open
fbedev wants to merge 15 commits intoopenai:mainfrom
fbedev:submission/qat-bigram12k-stride32
Open

Submission/qat bigram12k stride32#348
fbedev wants to merge 15 commits intoopenai:mainfrom
fbedev:submission/qat-bigram12k-stride32

Conversation

@fbedev
Copy link

@fbedev fbedev commented Mar 21, 2026

QAT + BigramHash(12288) + Stride 32 — 1.1444 bpb

Body:

Summary

  • QAT with STE (int5 MLP / int6 attn) reduces post-quantization degradation
  • BigramHash increased from 10240 to 12288
  • Eval stride reduced from 64 to 32
  • Magnitude pruning 5%, SWA every 25 steps
  • Artifact: 15.90MB

Results

  • seed=2024: val_bpb=1.14443
  • 8xH100 SXM, 6549 steps in 600s

Base

Built on records/track_10min_16mb/2026-03-20_10L_Int5MLP_MuonWD04_SWA50

fbedev added 6 commits March 21, 2026 20:35
Based on SOTA (10L_Int5MLP_MuonWD04_SWA50) with improvements:
- QAT with STE for int5/int6 quantization-aware training
- BigramHash increased from 10240 to 12288
- Eval stride reduced from 64 to 32 for better context
- Magnitude pruning increased from 3% to 5%
- SWA every 25 steps instead of 50
- Artifact size: ~15.89MB (under 16MB limit)
Restore original train_gpt.py baseline. Add new records folder with
submission script based on 10L_Int5MLP_MuonWD04_SWA50 SOTA.

Changes: QAT with STE, BigramHash 12288, eval stride 32,
5% magnitude pruning, SWA every 25 steps.
Port LoRA TTT from records/2026-03-17_LoRA_TTT into our submission.
At eval time, per-document rank-8 LoRA adapters are trained on Q/V
projections and lm_head, then used for scoring. Expected -0.003 to
-0.005 bpb improvement on top of sliding window eval.
val_bpb=1.14443 (seed=2024), artifact=15.90MB
Copilot AI review requested due to automatic review settings March 21, 2026 15:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new /records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32 submission artifact folder capturing a training run and the exact code/config used for a QAT + BigramHash(12K) + stride-32 sliding-window evaluation entry.

Changes:

  • Adds a full train_gpt.py snapshot implementing QAT (STE fake-quant), BigramHash embeddings, mixed int5/int6 quantization, pruning, SWA, and sliding-window eval.
  • Adds a train_seed2024.log run log and a short README.md describing the approach/results.
  • Adds submission.json metadata (reported val_loss/bytes_total/date/author).

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_gpt.py Training + quantization + export + eval script snapshot for this record submission.
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/train_seed2024.log Captured training/eval log for seed 2024 and reported final metrics.
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/submission.json Leaderboard metadata for the submission (name, loss, size, blurb).
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/README.md Human-readable summary, config highlights, and run command.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- **BigramHash 12288:** Increased from 10240 to 12288 buckets for better bigram coverage.
- **Eval stride 32:** Reduced from 64 to 32 for more overlapping context windows during evaluation.
- **Magnitude pruning 5%:** Increased from 3% to improve compression ratio.
- **SWA every 25 steps:** More frequent checkpoint averaging during warmdown.
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README claims “SWA every 25 steps”, but the actual default in train_gpt.py is swa_every = ... 50 (and the included log shows SWA starting at step 5400, consistent with 50-step cadence). Please either update the README to match the code, or change the default/command/env vars so the run truly uses SWA every 25 steps and regenerate the log/metrics accordingly.

Suggested change
- **SWA every 25 steps:** More frequent checkpoint averaging during warmdown.
- **SWA every 50 steps:** Checkpoint averaging during warmdown.

Copilot uses AI. Check for mistakes.
"name": "QAT + BigramHash(12288) + Stride 32",
"val_loss": 1.14443,
"bytes_total": 15902583,
"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json blurb states “SWA every 25 steps”, but train_gpt.py defaults to SWA_EVERY=50. For reproducibility, please align the blurb with the actual run configuration (or adjust the code/run to match the blurb and update the reported metrics if they change).

Suggested change
"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 25 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",
"blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.",

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +26
except ImportError:
_COMPRESSOR = "zlib"
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script silently falls back to zlib when zstandard isn’t installed, but the record metadata/README call out “zstd-22” and the reported bytes_total depends on the compressor. For reproducibility, consider failing fast when zstandard is missing (or at least reflecting the fallback clearly in README/submission metadata and the logged size label).

Suggested change
except ImportError:
_COMPRESSOR = "zlib"
except ImportError as exc:
raise RuntimeError(
"The `zstandard` package is required for this script to run reproducibly "
"with the documented 'zstd-22' compression. Please install it with "
"`pip install zstandard` and try again."
) from exc

Copilot uses AI. Check for mistakes.
fbedev and others added 2 commits March 21, 2026 23:27
…/train_gpt.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
fbedev added 7 commits March 22, 2026 10:13
…le, EMA, Late QAT, TTT

Major rewrite targeting top-5 leaderboard:
- 11 layers (from 10), BigramHash reduced to 10240 to fit 16MB
- XSA (Exclusive Self-Attention) on last 4 layers
- Partial RoPE: 16/64 head dims get position encoding
- LN Scale: 1/sqrt(layer+1) dampening on deeper layers
- EMA (decay=0.997) replaces SWA
- Late QAT: STE int6 enabled only in final 4% of training
- TTT: 25-epoch SGD on val data post-quantization
- FA3 auto-detection with SDPA fallback
- Reverted SwiGLU back to relu² (confirmed worse by openai#340, openai#344)
…y 10 steps

- Disable FA3 (SDPA faster for GQA on PyTorch 2.9)
- BigramHash 10240 -> 8192 to fit 11L under 16MB
- EMA update every 10 steps with adjusted decay to reduce CPU overhead
- Simplify attention forward (remove FA3 code path)
Previous run: 16.94MB with BigramHash 8192 + 5% pruning.
BigramHash 2048 saves ~0.5MB, 10% pruning improves compression further.
v3 was 16.38MB with BigramHash 2048 + 10% pruning.
Removing BigramHash saves ~0.15MB, 15% pruning improves zstd compression.
Fork of unnir's openai#374 (1.1246 BPB) with TTT added:
- 11L, XSA4, Partial RoPE 16/64, LN Scale, Tight SWA
- Shared VE128, SmearGate, BigramHash 2048
- TTT: 25 epochs SGD on val data post-quantization
- Trimmed to 1476 lines (under 1500 limit)
Previous TTT took 7+ min per epoch (uncompiled, single GPU).
Now: torch.compile + DDP across 8 GPUs + 3 epochs + batch 64.
Should finish in ~2-3 min total.
flash_attn_interface (FA3 Hopper) not available on RunPod.
Falls back to flash_attn, then SDPA with GQA support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants