Skip to content

Non-record: MLX prototyping harness with validated technique stack (val_bpb=1.9588, Mac)#328

Open
kingjulio8238 wants to merge 9 commits intoopenai:mainfrom
kingjulio8238:main
Open

Non-record: MLX prototyping harness with validated technique stack (val_bpb=1.9588, Mac)#328
kingjulio8238 wants to merge 9 commits intoopenai:mainfrom
kingjulio8238:main

Conversation

@kingjulio8238
Copy link

@kingjulio8238 kingjulio8238 commented Mar 21, 2026

Summary

Non-record submission — Mac MLX prototyping only, pending H100 validation.
Submitting to document systematic technique exploration and support compute grant application.

val_bpb: 1.9588 (14L×416d, 750 steps, 10 shards, int8+zlib, full FineWeb val, Apple Silicon)

Approach

25+ MLX experiments validating leaderboard techniques, identifying what works and what doesn't.

Implemented & Validated

  • 10-14 layer architectures with KV2 (GQA)
  • MLP 3x expansion (-0.013 BPB vs MLP 2x)
  • Int6 per-row quantization + FP16 tied embedding passthrough
  • Sliding window eval (stride-64, compiled forward)
  • Muon decoupled weight decay (0.02)
  • Overtone spectral embedding init (SVD power-law)
  • Phase-transition resid_mix initialization
  • Multi-eval mode (multiple eval configs per training run)

Key Finding

FP16 embed + Muon WD achieves near-zero quantization gap (0.001 BPB). Post-quant ≈ pre-quant.

Negative Results (documented)

Technique Result Why
Depth recurrence + int6 +0.61 BPB quant gap Shared weights amplify quantization error
DenseFormer DWA +0.003 regression No benefit at this scale
Eval-time loop scaling +1.15 regression Model calibrated to exact loop count
NTK-RoPE extrapolation (1024→4096) +0.06 regression Must train at target seq_len

Results

Config Params Compressed Val BPB Int8 BPB
14L×416d KV2 16.2M 12.3MB 1.9578 1.9588
10L×512d + all tricks 19.0M 10.8MB 1.9800 1.9808

Test plan

  • train_gpt_mlx.py compiles and runs from records folder
  • Full FineWeb val evaluation (62M tokens)
  • Int8+zlib roundtrip verified
  • Artifact under 16MB (12.3MB)
  • H100 validation (pending compute grant)

kingjulio8238 and others added 4 commits March 18, 2026 22:02
CLAUDE.md: agent working practices (plan mode, subagent strategy, iteration loop, verification, context efficiency)
docs/PLAN.md: full submission strategy — depth recurrence, QAT, test-time compute exploits with phased execution and compute budget

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds NUM_UNIQUE_BLOCKS/NUM_LOOPS hyperparameters for block sharing.
When enabled, loops a smaller set of shared blocks multiple times
instead of using the U-Net encoder/decoder skip architecture.
Baseline mode is fully preserved when disabled (default).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MLP_MULT=3 support (wider MLP, -0.013 BPB)
- Int6 per-row quantization (QUANT_BITS=6, saves ~4MB)
- FP16 tied embedding passthrough (FP16_EMBED=1)
- Sliding window eval with compiled NTK-RoPE (EVAL_STRIDE, EVAL_SEQ_LEN)
- Muon decoupled weight decay (MUON_WEIGHT_DECAY)
- Overtone spectral embedding init (OVERTONE_INIT)
- Phase-transition resid_mix init (PHASE_RESID_MIX)
- Extra eval loops support (EVAL_NUM_LOOPS)
- Multi-eval mode (EVAL_CONFIGS for testing multiple configs per run)
- VAL_MAX_TOKENS for fast directional experiments
- Compiled forward for eval (compiled_forward)

Validated on Mac: near-zero quant gap (0.0002 BPB) with FP16 embed +
Muon WD. All leaderboard openai#1 techniques implemented and tested.
Depth recurrence explored and rejected (int6 quant gap too large).

1260 lines, under 1500 limit. All new features default-disabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
25+ MLX experiments validating leaderboard techniques on Mac.
14L×416d, 750 steps, 10 shards, full FineWeb val, int8+zlib.

Key finding: FP16 embed + Muon WD gives near-zero quant gap (0.001 BPB).
Negative results documented: depth recurrence + int6, DWA, eval-time loops.
Supporting compute grant application for H100 validation.
4-phase plan based on analysis of top 5 leaderboard submissions.
Covers implementation needs, compute budget, and key techniques.
SmearGate, BigramHash, sliding window eval, Muon WD, OrthoInit,
Overtone init, phase resid_mix, int5/int6 quant, QAT with STE,
zstd-22 compression, FP16 embed passthrough, SWA.

1333 lines (under 1500). All features default-disabled (backward compatible).
Ready to run on H100 when compute credits arrive.
- Fix SWA: skip non-float tensors in averaging loop
- Fix quantization: guard behind master_process to save memory
- Update H100_PLAN: new SOTA 1.1254, add TTT/gradient-guided quant/LN Scale
- MLX sweep: WD=0.02 + LR=0.02 is best (-0.005 BPB), init tricks hurt at short training
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant