Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Autoresearch: Heads4 + Step-based LR + Sliding Window

**Non-record submission** — tested on 1xH100 only. Requesting compute grant for 8xH100 validation.

**val_bpb: 1.2727** (1xH100, 800 steps, sliding window eval stride=256)

## Approach: Automated Experiment Loop

This submission was developed using an autonomous experimentation methodology inspired by Karpathy's autoresearch. An automated experiment loop:
1. Modifies `train_gpt.py` with an experimental change
2. Runs training for a fixed time budget
3. Compares val_bpb against the current best
4. Keeps improvements, reverts failures
5. Repeats indefinitely

**75+ experiments** were run across three phases:
- **Phase 1 (Mac MLX, 40 experiments)**: Hyperparameter search discovered step-based LR schedule, optimal learning rates, and warmdown tuning
- **Phase 2 (1xH100 CUDA, 10 experiments)**: Validated Mac findings on CUDA, discovered NUM_HEADS=4 with head_dim=128 as a major architectural win
- **Phase 3 (1xH100 CUDA, 25 experiments)**: Built on current SOTA code, combining techniques from multiple leaderboard submissions

## Key Findings

### Confirmed Improvements (relative to SOTA baseline on 1xH100)
| Technique | Relative BPB Change | Source |
|-----------|-------------------|--------|
| NUM_HEADS=4, NUM_KV_HEADS=2 (head_dim=128) | **-0.095** | Our experiment |
| Step-based LR schedule (MWS=0) | **-0.483** | Our experiment |
| BigramHash 10240→16384 | -0.025 | Our experiment |
| MATRIX_LR 0.02→0.03 | -0.003 | Our experiment |

### Confirmed Non-improvements
| Technique | Result |
|-----------|--------|
| LoRA test-time training | Worse (-0.09 BPB) — chunk-based eval hurts |
| Block-wise weight sharing (2x) | Worse + 2x slower |
| NUM_KV_HEADS=1 (MQA) | Worse quality |
| SwiGLU activation | Worse than relu^2 |
| seq_len=4096 | Too slow per step |
| 11 layers | Better BPB but over 16MB budget |
| BigramHash(20480) | Better BPB but over 16MB budget |

## Architecture (built on SOTA)
- **10 layers**, 512 dim, **4 heads** (head_dim=128), **2 KV heads** (GQA)
- MLP 3x expansion (hidden=1536), relu^2 activation
- SmearGate + BigramHash(16384, dim=128) + orthogonal init
- SWA (start_frac=0.4, every=50 steps)
- Tied embeddings (FP16 passthrough, not int8 quantized)
- U-Net skip connections

## Training
- Muon optimizer: matrix_lr=0.03, WD=0.04, momentum=0.99
- **Step-based LR schedule** (not wallclock-based): ITERATIONS=800, WARMDOWN_ITERS=170
- seq_len=2048, batch=786K tokens
- grad_clip=0.3

## Evaluation
- Sliding window eval, stride=256, compiled forward_logits
- Int5 MLP / Int6 attention / FP16 embeddings + zstd compression

## Command
```bash
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

Note: Tested on 1xH100 only (800 steps in 600s). On 8xH100 this would get ~13,780 steps and significantly better BPB.

## Experiment Logs

Full experiment history in results_v3.tsv (25 CUDA experiments) and results.tsv (40 Mac experiments).
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"author": "Aryan Bhosale",
"github_id": "aryanbhosale",
"name": "Autoresearch: Heads4 + Step-based LR + Sliding Window",
"blurb": "Built on the SOTA (10L, int5/int6, BigramHash, SmearGate, SWA) with automated experiment loop (50+ experiments across Mac MLX and CUDA). Key improvements: NUM_HEADS=4 with head_dim=128 (-0.095 BPB relative improvement), step-based LR schedule, MATRIX_LR=0.03, BigramHash(16384), sliding window eval stride=256. Tested on 1xH100 only — requesting compute grant for 8xH100 validation.",
"date": "2026-03-21T00:00:00Z",
"val_bpb": 1.2756,
"pre_quant_val_bpb": 1.2913,
"note": "1xH100 result (800 steps, sliding window eval stride=256). Artifact is 17.4MB (over 16MB limit due to head_dim=128 increasing params). Needs int4/int5 MLP compression to fit. Non-record submission — requesting compute grant for 8xH100 validation.",
"step_stop": 800,
"wallclock_seconds": 541,
"eval_time_seconds": 317,
"bytes_total": 17398686,
"bytes_model_int6_zlib": 17341842,
"bytes_code": 56844
}
Loading