Non-record: MLX prototyping harness with validated technique stack (val_bpb=1.9588, Mac) by kingjulio8238 · Pull Request #328 · openai/parameter-golf

kingjulio8238 · 2026-03-21T08:59:26Z

Summary

Non-record submission — Mac MLX prototyping only, pending H100 validation.
Submitting to document systematic technique exploration and support compute grant application.

val_bpb: 1.9588 (14L×416d, 750 steps, 10 shards, int8+zlib, full FineWeb val, Apple Silicon)

Approach

25+ MLX experiments validating leaderboard techniques, identifying what works and what doesn't.

Implemented & Validated

10-14 layer architectures with KV2 (GQA)
MLP 3x expansion (-0.013 BPB vs MLP 2x)
Int6 per-row quantization + FP16 tied embedding passthrough
Sliding window eval (stride-64, compiled forward)
Muon decoupled weight decay (0.02)
Overtone spectral embedding init (SVD power-law)
Phase-transition resid_mix initialization
Multi-eval mode (multiple eval configs per training run)

Key Finding

FP16 embed + Muon WD achieves near-zero quantization gap (0.001 BPB). Post-quant ≈ pre-quant.

Negative Results (documented)

Technique	Result	Why
Depth recurrence + int6	+0.61 BPB quant gap	Shared weights amplify quantization error
DenseFormer DWA	+0.003 regression	No benefit at this scale
Eval-time loop scaling	+1.15 regression	Model calibrated to exact loop count
NTK-RoPE extrapolation (1024→4096)	+0.06 regression	Must train at target seq_len

Results

Config	Params	Compressed	Val BPB	Int8 BPB
14L×416d KV2	16.2M	12.3MB	1.9578	1.9588
10L×512d + all tricks	19.0M	10.8MB	1.9800	1.9808

Test plan

train_gpt_mlx.py compiles and runs from records folder
Full FineWeb val evaluation (62M tokens)
Int8+zlib roundtrip verified
Artifact under 16MB (12.3MB)
H100 validation (pending compute grant)

CLAUDE.md: agent working practices (plan mode, subagent strategy, iteration loop, verification, context efficiency) docs/PLAN.md: full submission strategy — depth recurrence, QAT, test-time compute exploits with phased execution and compute budget Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds NUM_UNIQUE_BLOCKS/NUM_LOOPS hyperparameters for block sharing. When enabled, loops a smaller set of shared blocks multiple times instead of using the U-Net encoder/decoder skip architecture. Baseline mode is fully preserved when disabled (default). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- MLP_MULT=3 support (wider MLP, -0.013 BPB) - Int6 per-row quantization (QUANT_BITS=6, saves ~4MB) - FP16 tied embedding passthrough (FP16_EMBED=1) - Sliding window eval with compiled NTK-RoPE (EVAL_STRIDE, EVAL_SEQ_LEN) - Muon decoupled weight decay (MUON_WEIGHT_DECAY) - Overtone spectral embedding init (OVERTONE_INIT) - Phase-transition resid_mix init (PHASE_RESID_MIX) - Extra eval loops support (EVAL_NUM_LOOPS) - Multi-eval mode (EVAL_CONFIGS for testing multiple configs per run) - VAL_MAX_TOKENS for fast directional experiments - Compiled forward for eval (compiled_forward) Validated on Mac: near-zero quant gap (0.0002 BPB) with FP16 embed + Muon WD. All leaderboard openai#1 techniques implemented and tested. Depth recurrence explored and rejected (int6 quant gap too large). 1260 lines, under 1500 limit. All new features default-disabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

25+ MLX experiments validating leaderboard techniques on Mac. 14L×416d, 750 steps, 10 shards, full FineWeb val, int8+zlib. Key finding: FP16 embed + Muon WD gives near-zero quant gap (0.001 BPB). Negative results documented: depth recurrence + int6, DWA, eval-time loops. Supporting compute grant application for H100 validation.

4-phase plan based on analysis of top 5 leaderboard submissions. Covers implementation needs, compute budget, and key techniques.

SmearGate, BigramHash, sliding window eval, Muon WD, OrthoInit, Overtone init, phase resid_mix, int5/int6 quant, QAT with STE, zstd-22 compression, FP16 embed passthrough, SWA. 1333 lines (under 1500). All features default-disabled (backward compatible). Ready to run on H100 when compute credits arrive.

- Fix SWA: skip non-float tensors in averaging loop - Fix quantization: guard behind master_process to save memory - Update H100_PLAN: new SOTA 1.1254, add TTT/gradient-guided quant/LN Scale - MLX sweep: WD=0.02 + LR=0.02 is best (-0.005 BPB), init tricks hurt at short training

kingjulio8238 and others added 4 commits March 18, 2026 22:02

notapplica mentioned this pull request Mar 21, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

kingjulio8238 added 5 commits March 21, 2026 12:21

Add H100 execution plan targeting sub-1.14 BPB

aa611b4

4-phase plan based on analysis of top 5 leaderboard submissions. Covers implementation needs, compute budget, and key techniques.

Update todo.md to reflect current state and SOTA

dd534d3

Update H100_PLAN: mark implemented techniques, list remaining work

109cda3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: MLX prototyping harness with validated technique stack (val_bpb=1.9588, Mac)#328

Non-record: MLX prototyping harness with validated technique stack (val_bpb=1.9588, Mac)#328
kingjulio8238 wants to merge 9 commits intoopenai:mainfrom
kingjulio8238:main

kingjulio8238 commented Mar 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kingjulio8238 commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Implemented & Validated

Key Finding

Negative Results (documented)

Results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kingjulio8238 commented Mar 21, 2026 •

edited

Loading