Skip to content

Add Hybrid Depth-Recurrent Transformer submission#341

Open
tobiascanavesi wants to merge 1 commit intoopenai:mainfrom
tobiascanavesi:hybrid-depth-recurrent
Open

Add Hybrid Depth-Recurrent Transformer submission#341
tobiascanavesi wants to merge 1 commit intoopenai:mainfrom
tobiascanavesi:hybrid-depth-recurrent

Conversation

@tobiascanavesi
Copy link
Copy Markdown

Hybrid Depth-Recurrent Transformer

Testing this new architecture that solves the int8 quantization compounding problem in depth-recurrent transformers.

Key Insight

Standard depth-recurrence shares all weights across loop iterationsm int8 rounding errors compound on every loop (0.40 BPB gap). The hybrid keeps precision-sensitive layers near input/output as unique weights, while only the bulk middle layers are shared and looped.
Result: quantization gap reduced from 0.40 to near-zero (-0.004 BPB).

Architecture

  • 1 unique entry layer + 4 shared blocks × 5 loops + 1 unique exit layer = 22 effective layers from 6 weight blocks
  • U-Net skip connections across full effective depth
  • Per-virtual-layer scalars (attn_scale, mlp_scale, resid_mix, q_gain)

Techniques

  • FP16 tied embedding passthrough during int8 quantization
  • Sliding window evaluation (stride=64, seq_len=1024)
  • Decoupled Muon weight decay (0.02)
  • Overtone spectral embedding init (SVD power-law shaping)
  • Phase-transition residual mixing initialization

Preliminary Results (2×H100)

Seed val_bpb Steps Artifact
1337 1.3323 954 14.2 MB

8×H100 run pending, expecting significant improvement with full compute.

Reproduce

WARMDOWN_ITERS=2500 MATRIX_LR=0.03 SCALAR_LR=0.03 TIED_EMBED_LR=0.04 torchrun --nproc_per_node=8 train_gpt.py

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Add Hybrid Depth-Recurrent Transformer submission

BPB: 0.004 (cache parse — may be delta/std, not val_bpb; check PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA c821a8c4d174, file records/track_10min_16mb/2026-03-21_HybridDepthRecurrent/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=768, layers=22, vocab=1024, code=58154 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=768, layers=22, vocab=1024, code=58154 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants