Skip to content

Non-record: MLX-Optimized 12L 416d with SmearGate + BigramHash (val_bpb=1.9011, Mac)#342

Open
adhyaay-karnwal wants to merge 2 commits intoopenai:mainfrom
adhyaay-karnwal:main
Open

Non-record: MLX-Optimized 12L 416d with SmearGate + BigramHash (val_bpb=1.9011, Mac)#342
adhyaay-karnwal wants to merge 2 commits intoopenai:mainfrom
adhyaay-karnwal:main

Conversation

@adhyaay-karnwal
Copy link

@adhyaay-karnwal adhyaay-karnwal commented Mar 21, 2026

Summary

  • Non-record submission for Parameter Golf challenge
  • 12-layer model (6 encoder + 6 decoder) with SmearGate, BigramHash(4096), FP16 embeddings
  • MLP 3x expansion, Muon optimizer with weight decay
  • Trained on MacBook Apple Silicon M4 Pro using MLX framework
  • Result: val_bpb = 1.9011 (500 iterations)

Key Techniques

  1. SmearGate: Learned gating mechanism blending each token with previous token's embedding
  2. BigramHash: Hash consecutive token pairs into 4096-bucket embedding table
  3. FP16 Embeddings: Near-zero quantization gap with Muon WD
  4. MLP 3x Expansion: relu^2 activation
  5. U-Net Skip Connections: Decoder layers receive skip connections from encoder layers

Architecture

  • 12 layers (6 encoder + 6 decoder)
  • 416 dim, 8 heads, 4 KV heads (GQA)
  • MLP 3x expansion (hidden=1248)
  • Tied embeddings

Training Details

  • Device: Apple Silicon M4 Pro (24GB unified memory)
  • Framework: MLX 0.31.1
  • Training tokens: ~16M (500 iters × 32K batch)
  • Tokens/sec: ~20,000-24,000

Notes

This is an undertrained model on MacBook. The same architecture with 3000+ iterations on H100s should achieve significantly better BPB (potentially 1.5-1.6 BPB based on findings and research completed). This submission demonstrates effective MLX optimization techniques and serves as a foundation for further H100 training.

Files

  • records/track_non_record_16mb/2026-03-21_MLX_Optimized_12L_416d_SmearGate_BigramHash/README.md - Detailed explanation
  • records/track_non_record_16mb/2026-03-21_MLX_Optimized_12L_416d_SmearGate_BigramHash/submission.json - Metadata
  • records/track_non_record_16mb/2026-03-21_MLX_Optimized_12L_416d_SmearGate_BigramHash/train_gpt_mlx.py - MLX training script
  • records/track_non_record_16mb/2026-03-21_MLX_Optimized_12L_416d_SmearGate_BigramHash/train.log - Training log

…Hash

Non-record submission for OpenAI Parameter Golf challenge.
Trained on MacBook Apple Silicon M4 Pro using MLX framework.

Key techniques:
- 12 layers (6 encoder + 6 decoder)
- 416 model dimension, MLP 3x expansion
- SmearGate for local context
- BigramHash with 4096 buckets
- FP16 embeddings with Muon optimizer + weight decay
- U-Net skip connections

Result: val_bpb = 1.9011 (500 iterations, undertrained)

This submission demonstrates effective MLX optimization techniques
and serves as a foundation for further H100 training.
@adhyaay-karnwal adhyaay-karnwal changed the title Non-record: MLX-Optimized 12L 416d with SmearGate + BigramHash Non-record: MLX-Optimized 12L 416d with SmearGate + BigramHash (val_bpb=1.9011, Mac) Mar 21, 2026
…techniques

- train_sota.py: New script with BigramHash(10240), WD=0.04, SWA
- train_optimized.py: Updated with faster validation
- train_breakthrough.py, train_breakthrough_v3.py: Experimental versions
- New submission folder with README and submission.json

Key improvements from research:
- BigramHash(10240): 2.5x larger than previous 4096
- SWA with start_frac=0.4: Optimal per openai#1 submission
- Muon WD=0.04: Higher than previous 0.02
- SmearGate: Proven technique from top submissions
- MLP 3x expansion: relu^2 activation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant