Skip to content

Non-record: Autoresearch Heads4 + Step-based LR + Sliding Window (1xH100)#344

Open
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/autoresearch-heads4
Open

Non-record: Autoresearch Heads4 + Step-based LR + Sliding Window (1xH100)#344
aryanbhosale wants to merge 1 commit intoopenai:mainfrom
aryanbhosale:submission/autoresearch-heads4

Conversation

@aryanbhosale
Copy link

Summary

Non-record submission exploring automated architecture search for Parameter Golf.

Built on the current SOTA (10L, int5/int6, BigramHash, SmearGate, SWA) with 75+ automated experiments across Mac MLX and 1xH100 CUDA using an autoresearch loop inspired by Karpathy's autoresearch methodology.

Key Findings

Technique Relative BPB Change Notes
NUM_HEADS=4, head_dim=128 -0.095 Fewer, larger heads
Step-based LR schedule -0.483 vs wallclock-based warmdown
BigramHash(16384) -0.025 vs 10240
MATRIX_LR=0.03 -0.003 vs 0.02

Results (1xH100, 800 steps)

  • Pre-quant val_bpb: 1.2913
  • Post-quant val_bpb (sliding window stride=256): 1.2756
  • Artifact size: 17.4MB (over 16MB — needs int4/int5 MLP compression)

Known Issues

  • Artifact exceeds 16MB due to head_dim=128 increasing param count. Compression optimization (int4/int5 MLP weights) needed to fit budget.
  • Tested on 1xH100 only. Requesting compute grant for 8xH100 validation.

Negative Results (also valuable)

  • LoRA test-time training: worse (-0.09 BPB)
  • Block-wise weight sharing: worse + 2x slower
  • SwiGLU activation: worse than relu^2
  • MQA (NUM_KV_HEADS=1): worse quality
  • seq_len=4096: too slow per step

Built on SOTA (10L, int5/int6, BigramHash, SmearGate, SWA) with 75+
automated experiments across Mac MLX and 1xH100 CUDA.

Key findings:
- NUM_HEADS=4 with head_dim=128: -0.095 BPB relative improvement
- Step-based LR schedule: -0.483 BPB vs wallclock-based
- BigramHash(16384): -0.025 BPB vs 10240
- MATRIX_LR=0.03: -0.003 BPB

Tested on 1xH100 (800 steps, 600s). Post-quant val_bpb: 1.2756
with sliding window eval stride=256.

Known issue: artifact is 17.4MB (over 16MB) due to head_dim=128
increasing params. Needs int4/int5 MLP compression to fit budget.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant