Diagnostics run on submission_170p/checkpoint_170p.pt (100% accuracy, 10010/10010 on AdderBoard).
Script: diagnostics_170p.py
14 tokens embedded in 2D (tok_dim=2). Tied with output head — embeddings double as classification centroids.
Token | dim0 dim1 | norm | angle
--------------------------------------------------
0 | 0.8237 -2.4393 | 2.575 | -71.3°
1 | 1.6545 -1.7787 | 2.429 | -47.1°
2 | 2.1099 -1.2656 | 2.460 | -31.0°
3 | 2.4182 -0.7553 | 2.533 | -17.3°
4 | 2.5908 -0.2124 | 2.600 | -4.7°
5 | 2.6024 0.3652 | 2.628 | 8.0°
6 | 2.4360 0.9194 | 2.604 | 20.7°
7 | 2.1178 1.4225 | 2.551 | 33.9°
8 | 1.5829 1.9504 | 2.512 | 50.9°
9 | 0.5531 2.5930 | 2.651 | 78.0°
PLUS | 0.9489 -1.0862 | 1.442 | -48.9°
EQUALS | -2.6499 1.8773 | 3.248 | 144.7°
EOS | -1.6019 -2.9403 | 3.348 | -118.6°
PAD | -1.1785 -0.2972 | 1.215 | -165.8°
Key findings:
- Digits 0-9 form a ~149° arc (from -71° to +78°), not a full circle
- Angular spacing: ~13° in the middle (digits 3-6), ~24-27° at extremes (0-1, 8-9)
- Tightest classification margins: digits 3-6 (margin ~0.28-0.29)
- PLUS sits dangerously close to digit 1: angle -48.9° vs -47.1°, separated only by norm (1.44 vs 2.43)
- EQUALS, EOS, PAD well-separated (margins >1.1) in opposite quadrants from digits
Pairwise distances (digits):
- Closest: 3↔4 (0.570), 4↔5 (0.578), 5↔6 (0.579)
- Farthest: 0↔9 (5.040)
Q-phase angles: [-35.3°, -10.3°] (head 0, head 1)
For answer position A_i, attends ~49%/49% to X_{i+2} and Y_{i+1}:
A0 → X2(49%) + Y1(49%)
A1 → X3(49%) + Y2(49%)
A2 → X4(49%) + Y3(49%)
A3 → X5(49%) + Y4(49%)
A4 → X6(49%) + Y5(49%)
A5 → X7(49%) + Y6(49%)
A6 → X8(49%) + Y7(49%)
A7 → X9(49%) + Y8(49%)
A8 → Zhi(97%) ← carry position
A9 → self(100%) ← pure self-attention
This is a carry-lookahead pattern. The model reads digits 1-2 positions ahead to predict incoming carries, rather than chaining through previous answer positions.
For A_i, splits ~32%/32%/32% across X_{i+1}, Y_i, and self:
A0 → X1(33%) + Y0(32%) + self(33%)
A1 → X2(32%) + Y1(32%) + self(32%)
...
A7 → X8(33%) + Y7(33%) + self(33%)
A8 → X9(32%) + Y8(32%) + self(32%)
A9 → self(100%)
Reads the current Y digit, next X digit, and own residual.
No A_{i-1} → A_i attention. Carry propagation is NOT sequential through answer tokens. The model computes carries from lookahead on input digits, implemented as fixed positional offsets.
The patterns are perfectly regular shifts, not content-dependent. Head 0 always does 50/50 on two specific offset positions. Head 1 always does 33/33/33 on three. The 26 params in q_proj + q_phase are learning a simple offset function.
Head 0: digit-pair=0.007, carry-chain=0.000, self=0.107
Head 1: digit-pair=0.298, carry-chain=0.008, self=0.391
Head 0: + → X1(97%) = → X1(49%) + Y0(49%)
Head 1: + → +(47%) + X0(47%) = → +(32%) + =(32%) + X0(32%)
FFN: 6→2 (fc1+bias, GELU) → 2→6 (fc2+bias). Bottleneck dim=2.
fc1 reads: tok=[-0.811, -1.986], pos=[0.020, 0.060, -0.036, -2.637], bias=0.782
||w_tok||=2.145, ||w_pos||=2.638 → reads mostly from pos
fc2 writes: tok=[0.336, 1.819], pos=[0.040, -0.061, -0.184, 1.419]
→ writes mostly to tok subspace
Reads positional information (where am I?), writes token correction (carry adjustment).
fc1 reads: tok=[0.604, 1.691], pos=[-0.018, 0.004, 0.376, 0.323], bias=0.769
||w_tok||=1.795, ||w_pos||=0.496 → reads mostly from tok
fc2 writes: tok=[-0.364, -2.178], pos=[0.093, -0.042, 0.071, 1.359]
→ writes mostly to tok subspace
Direct digit transformation.
- Max |weight| in fc1: 2.637
- Hand-coded models use ~60000 for hard step functions
- Our GELU transitions are soft — effective slope ~2.6x at threshold
- fc2 bias (constant additive to residual): [-0.636, 0.041, -0.475, 0.073, 0.707, -1.139]
amp: +2.9518 (init: +1.000, drift: +195%)
phase: -1.0631 (init: 0.000, drift: -61°)
slope: +0.1276 (init: +0.111, drift: +15%)
offset: -0.7446 (init: 0.000, drift: -0.74)
corr_slope: +0.0041 (init: 0.000, drift: +0.004)
corr_intercept: +2.3398 (init: 0.000, drift: +2.34)
Not freezeable. Total drift: 6.12 across 6 params. Amp tripled, phase rotated -61°, correction intercept grew to +2.34 (scaling all positions ~3.3x).
pos | dim0 dim1 dim2 dim3 | corr | norm
-----------------------------------------------------------------
0 | 4.7932 -8.6147 -2.4868 0.0000 | 2.340 | 10.167
1 | 8.9524 -4.1572 -2.0630 0.0000 | 2.344 | 10.084
2 | 9.6981 1.9011 -1.6382 0.0000 | 2.348 | 10.018
3 | 6.7367 7.2473 -1.2124 0.0000 | 2.352 | 9.969
4 | 1.1917 9.8349 -0.7855 0.0000 | 2.356 | 9.938
5 | -4.8226 8.6677 -0.3575 0.0000 | 2.360 | 9.925
6 | -9.0073 4.1827 0.0715 0.0000 | 2.364 | 9.931
7 | -9.7575 -1.9128 0.5015 0.0000 | 2.369 | 9.956
8 | -6.7779 -7.2917 0.9326 0.0000 | 2.373 | 9.999
9 | -1.1990 -9.8951 1.3647 0.0000 | 2.377 | 10.060
Critical: dim 3 is ZERO for all digit positions. The spiral only fills dims 0-2. Dim 3 is dead for digits but alive for special positions:
z_hi (carry): [-0.2282, 0.1359, -0.5375, -1.2559] norm=1.392
PLUS: [ 2.0737, -3.6982, 1.1445, 0.1785] norm=4.395
EQUALS: [-1.2123, -4.6892, -5.3494, 0.7297] norm=7.253
EOS (frozen): [ 0.0000, 0.0000, 0.0000, 0.0000]
z_hi's dim 3 (-1.2559) is its largest component — the carry position uses the "dead" dimension as its primary identifier. This means pos_dim=4→3 (d_model=5) drops the carry position's most important dimension.
Matrix Shape Singular Values Eff.Rank Cond
q_proj [6, 4] 6.24 4.25 4.00 3.01 3.86 2.1
v_proj [6, 2] 3.33 2.16 1.95 1.5
out_proj.A [6, 2] 3.48 1.57 1.86 2.2
out_proj.B [2, 6] 3.45 2.83 1.99 1.2
out_proj (A@B) [6, 6] 12.01 4.45 ~0 ~0 ~0 ~0 1.79 huge
fc1 [2, 6] 3.71 1.14 1.73 3.2
fc2 [6, 2] 2.89 1.96 1.96 1.5
head_proj [2, 6] 12.08 11.11 2.00 1.1
tok_emb (digits 0-9) [10, 2] 6.36 4.98 1.99 1.3
Key findings:
- q_proj is genuinely rank-4 (eff_rank=3.86). All 4 singular values contribute. No compression here.
- fc1 is nearly rank-1 (91.3% energy in SV1, eff_rank=1.73). But rank-1 fails during training (same as out_proj).
- head_proj is maximally rank-2 (condition=1.1). Both dims fully utilized.
- out_proj is effectively rank-2 as designed (SV: 12.0, 4.5, then ~0).
ln1: [0.375, 1.613, 1.002, 0.923, 0.961, 2.278] std=0.60
ln2: [0.743, 1.847, 0.354, -0.295, -0.312, 1.570] std=0.84
ln_f: [0.380, 11.823, 0.768, 0.021, 0.471, 10.061] std=5.00
ln_f is doing radical feature selection, not normalization:
- Dims 1, 5 amplified 10-12x
- Dims 3, 4 suppressed to near-zero (~0.02, ~0.47)
- This means the 6D residual is really a 2D output space + 4D scratch space
- All 3 norms are highly specialized (confirming they can't be shared or removed)
Attention is fixed positional routing with regular offsets (Head 0: +2/+1, Head 1: +1/0/self). The 26p Q/K machinery (q_proj=24, q_phase=2) learns what could be expressed as per-head positional offsets (~6-8p). However, ALiBi (which tried this) failed — likely because it uses distance-based slopes, not group-aware offsets (the model needs different shifts for X-group vs Y-group positions).
The model implements hardware-style carry lookahead: position i reads digits i+1 and i+2 to predict carries. No A_{i-1}→A_i attention. This is optimal for 1-layer autoregressive.
Dim 3 is the carry position's primary identifier. Shrinking pos_dim from 4→3 forces the carry position to share dimensions with digit positions, potentially degrading carry detection. This may explain d_model=5's plateau at 8-12% exact.
PLUS and digit 1 are 1.8° apart in embedding space. Removing PLUS from the vocabulary eliminates this near-collision and frees angular space for digit separation.
The final norm's extreme weights (0.02 to 11.8) show the model uses only 2 of 6 residual dims for output. The other 4 are internal scratch. This is relevant for d_model shrink — at d_model=4, the model has only 2 dims of scratch after reserving 2 for output.
FFN unit 0 reads position, writes token — this is how carry information (positional) becomes a digit correction (token). This pos→tok bridge is essential and must be preserved in any compression.