LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers#347
Draft
FlashyFlash3011 wants to merge 1 commit intoopenai:mainfrom
Draft
LongContext 4096 + Full SOTA Stack & QAT Int4 → 16 Layers#347FlashyFlash3011 wants to merge 1 commit intoopenai:mainfrom
FlashyFlash3011 wants to merge 1 commit intoopenai:mainfrom
Conversation
Two new submissions targeting sub-1.1698 BPB: 1. 2026-03-21_LongContext4096_FullStack - 4096-token training context + full modern SOTA stack - Sliding window eval stride=256 (3840 context tokens per position) - Same eval cost as SOTA: 64x4096 = 256x1024 tokens per batch - NTK-aware RoPE base=40000, re-tuned LRs/momentum for 4096 context 2. 2026-03-21_QAT_Int4_16L - Int4 nibble-packing enables 16 transformer layers in 16MB budget - QAT with straight-through estimator activates at 15% of training - All SOTA techniques carried forward (Muon WD, FP16 embed, Overtone init)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Submissions
Two experiments targeting sub-1.1698 BPB. Scripts are complete and smoke-tested. Full 3-seed H100 runs are in progress (compute grant pending) — results and train logs will be added before final review.
1. LongContext 4096 + Full SOTA Stack
Folder:
records/track_10min_16mb/2026-03-21_LongContext4096_FullStack/The 4096-seq training record (1.2014 BPB) was submitted before sliding window eval, FP16 embeddings, Muon WD, or Overtone init existed. This combines all of those with long-context training:
seq_len=4096, eval with sliding windowstride=25664 seqs × 4096 = 256 seqs × 1024tokens per batchbase=40000(= 10000 × 4096/1024)matrix=0.025) and Muon momentum (0.98) for 4096 contextExpected: ~1.14–1.16 BPB
2. QAT Int4 → 16 Layers
Folder:
records/track_10min_16mb/2026-03-21_QAT_Int4_16L/Int4 nibble-packing (2 weights/byte) fits 16 transformer layers in the same 16MB budget as SOTA's 10 — a 60% parameter increase.
matrix_lr=0.030, Muon momentum0.97,grad_clip=1.0Expected: ~1.14–1.16 BPB
Status