Native compiler with atlas-bound theorems โ strict-lint ยท citation-enforced ยท no LLVM ยท no C-transpile
Atlas-bound ยท strict-lint ยท 8-stage gate ยท ฮต self-proof ยท n=6 perfect-number primitives ยท self-hosted
hexa-lang is a native compiler that carries its own theorem ์ฌ์ (dictionary) inside the binary. No LLVM. No C-transpile. Every formula in your code either cites the atlas or the build refuses to start. The stricter the gate, the cleaner the code that passes.
Note
Sister of n6 (semantic atom layer โ atlas serialisation format), hxc (byte-canonical wire), and tape (operational trace). hexa-lang's atlas is unconditionally binary built-in โ compile-time embedded into the compiler โ and .n6 is the sister serialisation format emitted on demand by hexa atlas export for interop / inspection. Discovered laws are absorbed via GitHub PR directly into the embedded atlas, not through a runtime .n6 overlay. The wilson agent (dancinlab/wilson) is built end-to-end on hexa-lang.
@cite(L[sigma_phi_n_tau_iff_n_eq_6])
fn perfect_at_six() -> bool {
let n = 6
return sigma(n) == 2 * n // ฯ(6) = 12 = 2ยท6
&& phi(n) * tau(n) == 8 // ฯ(6)ยทฯ(6) = 2ยท4 = 8 = ฯ(n)โnโฯ(n)+1
}
// Untouched citation = HX8004 fatal at compile time:
//
// error[HX8004]: formula-bearing function does not cite atlas L[*]
// --> src/foo.hexa:14:1
// |
// 14 | fn area_of_circle(r: f64) -> f64 {
// | ^^^^^^^^^^^^^^^^^ formula here
// = note: cite an atlas law via `@cite(L[id])` or declare `@grace(HX8004, until=, reason=)`
// = help: hexa atlas search "ฯrยฒ" โ L[circle_area]
The compiler stays parked unless every formula either cites the atlas, has an active @verify, or carries an explicit @grace. There is no "we'll fix it after." There is no binary.
LLMs answer by recombining what their weights already contain โ noise from inside a frozen well. hexa-lang generates from outside the well: every compile cycle produces a primitive the previous cycle could not express, then absorbs it as a new wall (@verify โ atlas promote โ tombstone retroactive sweep). The atlas grows; hallucination is mechanically excluded because every claim must trace to a citation.
The second pillar is enforcement at the build gate, not at runtime. Eight strict-lint stages (S0 parse โ S1 resolve โ S2 bind โ S3 type โ S4 domain โ S5 units โ S6 equational @verify โ S7 proof @prove โ S8 citation HX8004) reject formula-bearing code that doesn't cite. No annotations means no formula. No formula in a non-cited function means a hard error.
Third: n=6 perfect-number primitives. The compiler is a ์
ฐํ (chef) with a 4.2 MB atlas baked statically into the binary โ 60,760 lines of P (primitives) / C (constants) / L (laws) / E (errors). Citing L[sigma_phi_n_tau_iff_n_eq_6] is one keystroke; if the law is wrong, every dependent gets a tombstone cascade with an auto-PR.
.hexa source
โ
โผ
lex โโบ parse โโบ resolve โโบ bind โโบ types โโบ domain โโบ units โโบ citation
(S1) (S2) (S3) (S4) (S5) (S8)
โ โ
โ any fatal stage โ no binary โ
โผ โผ
lower (HIR) โโบ mono โโบ MIR (SSA) โโบ optimize โโบ regalloc (LIR) โโบ emit (asm)
โ โ
โผ โผ
hexa_ld v1.1
ELF64 + Mach-O arm64 static
โ
โผ
native binary
A binary appears only when every fatal stage passes. The atlas (4.2 MB) is baked in at compile time โ runtime cost: 0 ms.
๐ฅ flame ยท ๐ง forge ยท โก hexa-cuda โ the hexa GPU stack (train ยท substrate ยท kernel-authoring)
stdlib/flame is what you build with hexa-lang: a compiler-only neural-network training stdlib (autograd tape ยท layers ยท optimizers ยท tensor primitives) lowered through the same 8-stage strict-lint gate that compiles the compiler itself. No PyTorch wrapping, no ATen import, no Python in the trained binary.
self/forge is what flame calls into: a GPU substrate that pairs device-resident hexa arrays (farr) with vendor-grade kernels (cuBLAS Dgemm + 11 hand-emit .cu kernels covering the elementwise / reduction / norm surface) under a byte-equal correctness contract, plus a BF16 Tensor-Core "mega-kernel" path (RFC 049/060) for the in-kernel-GEMM regime where vendor libs are reachable.
@gpu_kernel โ nvptx (hexa-cuda) is how you author a GPU kernel without leaving hexa: annotate a function @gpu_kernel, write it with the device intrinsics (gpu_thread_id_x ยท @shared let ยท gpu_barrier ยท gpu_atomic_add ยท gpu_warp_shuffle), and hexa build --target=nvptx emits ptxas-clean PTX for sm_80 / sm_90 โ no .cu, no nvcc, no CUDA-C transpile (silicon-proven: vec-add / saxpy run bit-exact on a native H100). It is the kernel-authoring primitive that forge's own device kernels and your own custom kernels both share; you practice it in hexa dojo.
The three pillars (flame:forge :: torch:ATen, with hexa-cuda as the kernel-authoring leg both rest on):
hexa source (.hexa)
โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
โ ๐ฅ flame โ NN training stdlib โ โ what you TRAIN with
โ t_* tensor ยท ag_tape autograd โ (no Python in the binary)
โ nn_lib layers ยท opt_* optimizer โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ rides
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ง forge โ GPU substrate โ โ what flame CALLS INTO
โ farr device array ยท own-GEMM โ cuBLAS Dgemm + 11 .cu
โ BF16-TC mega-kernel โ RFC 040/041/049/060
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ device kernels authored in
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โก hexa-cuda โ @gpu_kernel โ nvptx โ โ how you WRITE a GPU kernel
โ gpu_thread_id ยท @shared ยท barrier โ hexa โ PTX โ sm_80 / sm_90
โ no .cu ยท no nvcc ยท compiler emits โ practice: `hexa dojo`
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ hexa build (8-stage strict-lint gate)
โผ
A100 / H100 native
| layer | scope | measurement |
|---|---|---|
| forge substrate | RFC 040 device-farr + cuBLAS Dgemm ยท RFC 041 11-op .cu |
12 byte-equal fires across the elementwise / reduce / GEMM surface, max|ฮ| = 0 |
| flame layers | rmsnorm ยท attn-fwd ยท attn-bwd ยท silu-gate | 4 byte-equal oracle fires, max|ฮ| = 0 |
flame ag_tape |
generic autograd through the same oracles | derivation byte-equal, abstraction pays no correctness tax |
| path | measurement | note |
|---|---|---|
| forge BF16-TC mega-kernel (RFC 049 Stage 1, A100) | 9.67ร faster than FP64 cuBLAS @ Llama-7B FFN shape | $0.10 fire ยท paradigm verdict from Phase R 14-fire $2.91 campaign |
| forge Phase R / RFC 060 closure | FP64 mega-kernel KILLED (1.8-4.4ร slower than per-op) ยท BF16 substrate PASS | RFC 060 100% closure ยท BF16-TC is the cuBLAS-relative wall path |
flame ag_tape d=768 ยท 12-layer (A100) |
per-step wall recorded ยท PyTorch wall speedup NOT measured | prior README "2.95ร / 1.26-1.76ร faster than PyTorch eager" was a unit mismatch (full-run / 1-step) โ RETRACTED per stdlib/flame/README.md correction 2026-05-19 |
| flame batch-fill SM-fill (CLMConvMoE, H100) | โฅ1.3ร @B=2, 2.95ร @B=32 self-speedup (byte-eq B=1 max|ฮ|=0; B>1 causal-conv seam-only ฮ) | batch FILLS the SMs (B=1 under-fills, util 1-2%); capped ~3ร by the interpreted per-step glue (token-pack + CE-grad + AdamW โ BยทTw) โ ~3x cap is STRUCTURAL (serial un-fused FP64 op-DAG); uncap = precision-change OR right-sized GPU, NOT interpreter (#2915) |
| flame vs PyTorch (H100, batch=1) | torch eager ~1656ร / torch.compile ~2207ร faster than flame FP64 | honest: flame value = byte-exact ยท device-resident ยท no-LLVM, NOT step-rate; kernel-fusion (capture/replay, fwd+bwd) = ~1.0ร closed-neg |
Honest scope: flame's
ag_tape+ nn_lib + opt_* are functionally complete and byte-equal-verified; forge'sfarr + cuBLAS Dgemm + 11 .cusubstrate is complete with the BF16-TC mega-kernel landing as the cuBLAS-relative wall path. End-to-end flame โ PyTorch wall comparison is pending an apples-to-apples re-fire โ the substantive cuBLAS-relative win currently sits at the forge layer (BF16-TC 9.67ร over FP64-cuBLAS on the FFN-shape mega-kernel).
The substrate above calls cuBLAS for the GEMM itself โ the one piece forge did not own. The CUDA-OWN campaign closes that last gap: an env-gated own-GEMM (HEXA_OWN_GEMM family) routes every matmul through a hexa-emit kernel instead of cuBLAS. OFF by default โ cuBLAS stays the default path; flip the env and the entire device GEMM is hexa source โ FP64, FP32, and a CUTLASS-grade TF32 WMMA2 tiled kernel.
forge GEMM dispatch (env HEXA_OWN_GEMM / _WMMA2 โ OFF == cuBLAS default):
OFF โ cuBLAS Dgemm / Sgemm (vendor, default)
ON โ _hx_k_gemm (FP64) ยท _hx_k_sgemm_cm (FP32) ยท _hx_k_sgemm_cm_wmma2 (TF32 WMMA2)
โโ launcher precedence WMMA2 > WMMA > TILED > naive โโ 100 % hexa-ownable
Correctness first (own-GEMM vs cuBLAS oracle):
| own-GEMM path | shape / harness | correctness vs cuBLAS oracle | verdict |
|---|---|---|---|
FP64 _hx_k_gemm (clm_prod train, cuBLAS-GEMM-free) |
D1536 real-corpus train, both arms | max|ฮ CE| = 0.00000 @ 5-dec, CE descends 4.46624 โ 3.64669 | ๐ข F-FUSION-P1-OWN-GEMM-CORRECTNESS |
FP32 _hx_k_sgemm_cm (hxqwen14b train, cuBLAS-GEMM-free) |
M=N=K=2048 R=16 GEMM-bound | rel-RMS ~1e-6 (worst 9.70e-7) all outputs, within fp32 tol | ๐ข F-FUSION-P1D-LLM-SGEMM |
TF32 _hx_k_sgemm_cm_wmma2 (CUTLASS-grade tiled) |
M=N=K=2048 + non-tile-multiple bounds | rel-RMS 2.6e-4 โช 3e-3 TF32 bar, bounds-guarded | ๐ข F-FUSION-CUTLASS-GRADE-WMMA |
Performance โ โ cuBLAS-CLASS, NOT superiority (this is parity, stated plainly):
| measurement | own-GEMM | cuBLAS | gap | verdict |
|---|---|---|---|---|
| sustained-loop GPU util (2048ยณ, B200, nvidia-smi) | 89.9 % MEAN / 100 % PEAK | 88.5 % MEAN / 100 % PEAK | both ~90 %, cuBLAS-class occupancy | ๐ข F-FUSION-OWN-GEMM-UTIL |
| GEMM-iso step-time (2048ยณ WMMA2 vs cuBLAS) | 0.77047 ms/iter | 0.68 ms/iter ref | 1.13ร of cuBLAS (within ~13 %) | ๐ข F-FUSION-CUTLASS-GRADE-WMMA |
| LLM full-step (LoRA, M=8192, shape-dispatch + split-K) | 454.9 steps/s | 565.2 steps/s | 1.24ร of cuBLAS (down from raw 2.24ร) | ๐ข F-FUSION-THRU-PARITY ยท F-FUSION-SPLITK-SKINNY |
The full-step gap closed in two landed steps: skinny-shape dispatch (16ร16 tiled) took the raw 2.24ร โ 1.67ร (~46 % of the gap, F-FUSION-THRU-PARITY), then a split-K skinny GEMM took 1.67ร โ 1.24ร (a further 64 % of what remained, F-FUSION-SPLITK-SKINNY) โ cumulatively ~80 % of the original 2.24ร closed.
โ ARCH CAVEAT โ the 1.13ร iso figure is Blackwell-sm_120-ONLY; native sm_90 H100 is a different story (
F-FUSION-WMMA2-SM90-VERIFY#2796 ยทF-FUSION-SM90-DYNSHARED-FIX). The WMMA2 kernel's staging is 57 344 B of shared, which exceeds the sm_90 per-block static__shared__cap (49 152 B) โ so on native Hopper sm_90 the kernel originally did not launch at all (cudaErrorInvalidValue); the 1.13ร was measured on Blackwell sm_120, whose larger static admit absorbed it.F-FUSION-SM90-DYNSHARED-FIXconverts the staging toextern __shared__(dynamic) +cudaFuncSetAttribute(...MaxDynamicSharedMemorySize, 57344)โ this DOES make WMMA2 LAUNCH on sm_90 (compute_cap 9.0 verified, correctness rel-RMS 4.77e-06 PASS). But native-H100 PARITY is NOT restored: on sm_90 the own WMMA2 GEMM measures 1.49 ms/iter @ 2048ยณ = 29.46ร slower than cuBLAS (0.0507 ms/iter), because the kernel is register/occupancy-bound on Hopper (REG:236/thread โ ~1 block/SM). The Blackwell 1.13ร did not transfer to native sm_90 โ a separate occupancy axis, not the shared-mem launch fix.โณ occupancy axis TESTED and RULED OUT (
F-FUSION-SM90-WARPTILE-RETUNE, closed-negative). A register-reduced WMMA2 variant (_hx_k_sgemm_cm_wmma2_rr, envHEXA_OWN_GEMM_WMMA2_RR, default OFF, math-identical) cut registers 236 โ 128/thread via__launch_bounds__(256,2)+ streamed input fragments, doubling occupancy 1 โ 2 blocks/SM โ yet on native sm_90 H100 own throughput did not rise: 11.1 โ 10.7 TFLOP/s (-4%, the cuBLAS gap widened 31.4ร โ 32.5ร; rel-RMS 4.77e-06 both). So register/occupancy is NOT the binding constraint โ the WMMA2 own-GEMM is bound by its inner-loop math pipeline (per-element software__float_to_tf32, depth-1 cp.async prefetch, scalar epilogue), not block count. Closing the ~31ร sm_90 gap needs a cuBLAS-class TC mainloop rework (deep pipelining,ldmatrix/mma.syncw/o per-element TF32 rounding, register-blocked accumulation), a multi-session pipeline rewrite โ not a one-knob retune.โณ cuBLAS-class mainloop ATTEMPTED โ 3/4 levers landed, mma.sync is the ceiling (
F-FUSION-SM90-CUBLAS-MAINLOOP, closed-negative). A reworked own-GEMM (_hx_k_sgemm_cm_wmma2_cb, envHEXA_OWN_GEMM_WMMA2_CB, default OFF) landed three of the four cuBLAS-class levers: L1 hardware-TF32mma.sync.m16n8k8(round fused viacvt.rna.tf32.f32, dropping the per-element software__float_to_tf32sweep), L2 a deep multi-stagecp.asyncpipeline, and L4 a register-blocked epilogue (4 D-regs/thread written straight to col-major C). The fourth, L3ldmatrix, did NOT land โldmatrix.x4is a 16-bit-element op and the TF32 operands are 32-bit (named residual: a 32-bitldmatrix.transswizzle). On native sm_90 H100 (2048ยณ) the CB variant is numerically exact (rel-RMS 0.000e+00, bit-equal to the cuBLAS-TF32 oracle on this seed) and +3.4% faster than the parent (11.17 โ 11.55 TFLOP/s, gap 30.4ร โ 29.4ร โ only ~3.4% of the cuBLAS gap closed). The on-pod stage-depth sweep saturated at 2 stages (a second independent confirmation that globalโshared latency is not the bottleneck). Finding: the binding constraint is themma.syncwarp-level instruction class itself โ on Hopper, cuBLAS reaches TC peak viawgmma.mma_async(warpgroup-level async MMA) fed by TMA (cp.async.bulk.tensor), a different instruction classmma.synccannot reach regardless of mainloop tuning. The remaining ~29ร is a wgmma + TMA rewrite (a CUTLASS-3.x-class sm_90a kernel), not anmma.syncmainloop retune. Honest scope: parity-seeking, cuBLAS = roofline, no superiority claim.โณ wgmma + TMA rewrite โ FEASIBILITY PASS, layout residual (
F-FUSION-SM90-WGMMA-TMA). The named wgmma/TMA lever is now build- and run-feasible on native sm_90 H100: with-arch=sm_90a(nvcc 12.6),wgmma.mma_asyncexecutes correctly (f16 probe: nonzero 2048/2048, sum 1962.49 vs ref 1956.87) and the entire Hopper async PTX surface compiles โcuTensorMapEncodeTiled(TMA) +cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes+mbarrier.*. This converts the priorF-FUSION-ATTN-WGMMA-WALLhardware-blocked closed-negative (same kernel silently NOP'd on Blackwell sm_120) into testable-on-Hopper. The own-source emit path is NOT the blocker. The standalone TF32 warpgroup mainloop builds + launches but is not yet bit-correct (rel-RMS 1.309e+00 vs 3e-3); an isolated descriptor sweep + a structured-input diagnostic proved the binding residual is the wgmma no-swizzle 8ร16B core-matrix shared-memory layout of the operands (NOT the instruction, descriptor offsets, or the verified-correct epilogue registerโC mapping). No own GFLOP/s is reported โ g5 forbids perf on a wrong-result kernel; parity NOT measured. Kit:self/native/wgmma/. Honest scope: parity-seeking, cuBLAS = roofline, no superiority claim.โณ swizzle wedge โ residual PINNED to the B-operand K-core stride, NOT a permutation (
F-FUSION-SM90-WGMMA-SWIZZLE, closed-negative this pass). An on-hardware reverse-engineering (distinct-ramp operand + one-hot selector; native H200 sm_90a, nvcc 12.6, pod DESTROYED leak 0) isolated the rel-RMS 1.309 residual to two superimposed defects in the wgmma B no-swizzle core-matrix layout: (1) a K-stride collapse โ for contraction index k=1..7 wgmma re-reads B's K=0 core-matrix (the decoded kโฒ is pinned to 0 for every K-selector), the dominant โโ2 error; (2) an N-octet interleave (output col n reads logical col โ4n within an 8-wide octet). A >2300-config exhaustive sweep โ A/B shared layout โ {plain row-major, two 8-row-strip core tilings, col-major-B} ร descriptor (LBO,SBO) โ {16โฆ512} for both operands ร 3 epilogue register-maps, fault-isolated per-process โ found no config below rel-RMS 1.36, deterministically ruling out the hypothesis that the residual is a plain-layout/offset/epilogue permutation. The fix requires the genuine CUTLASSGMMA::Layoutcore-matrix builder (B's 8 K-values forming a contiguous 8-row core-matrix, descriptor LBO = one-core-matrix stride, swizzle field matched to the TMAcuTensorMapEncodeTiledswizzle mode), verified FIRST on the single-tile decode probe to kโฒ==KSEL identity before any 2048ยณ run. Kit:self/native/wgmma/{wgmma_tf32_decode,wgmma_tf32_bdecode,wgmma_tf32_full}.cu. Still parity-seeking, no perf number on a non-bit-correct kernel (g5).โณ GMMA::Layout core-matrix builder โ wgmma TF32 is now BIT-CORRECT on native sm_90a (
F-FUSION-SM90-WGMMA-GMMA-LAYOUT, ๐ข numerical). The swizzle is SOLVED. The root cause of the >2300-config dead end was a single wrong constant: a wgmma core matrix is 8 rows ร 16 bytes = for TF32 (4 B/elem) 8 rows ร 4 ELEMENTS, not the 8ร8 K-strip the prior kit assumed โ that 8-vs-4 mismatch is both pinned defects (K-stride collapse + N-octet interleave). Implementing the real CUTLASS-3.x GMMA INTER (no-swizzle) 8ร4 core-matrix layout (gmma_phys = (strip*2+kcore)*32 + sr*4 + kc, descriptorstart[0,14) LBO[16,30) SBO[32,46) layout_type[62,64)=INTERLEAVE, LBO=128 B / SBO=256 B inter-core strides) made the single 64ร64ร8 wgmma tile bit-exact (W2 rel-RMS 0.000e+00, native H100 SXM cc 9.0,-arch=sm_90a, nvcc 12.6, pod DESTROYED leak 0). Scaling to the full 2048ยณ GEMM revealed a second, separate defect โ a K-loop async-proxy ordering bug (wgmmareads shared through the async proxy, which ordinary__syncthreadsdoes not order against generic stores; non-deterministic 3e-2โฆ1e-1 past Kโ1536). Addingfence.proxy.async.shared::ctaafter staging makes the K-loop bit-exact at 2048ยณ (W3 own-vs-cuBLAS-TF32 & own-vs-CPU-f64 both rel-RMS 0.000e+00, deterministic). Parity is now MEASURABLE (g5 satisfied): the naive single-wgmma-per-block kernel runs 20.2 TFLOP/s @ 2048ยณ (cuBLAS-TF32 357.5, 17.67ร off, PARITY=NO); a first pipeline tune (wide-N TN=128, 2 wgmma/K-step reusing A) nearly doubles it to 38.0 TFLOP/s (9.35ร off), still bit-exact (TN=256 is slower โ register/occupancy bound). The own-GEMM is now provably CORRECT on sm_90a; the remaining gap is a pure latency-hiding residual โ a full warp-specialized TMA multi-stage CUTLASS mainloop (cp.async.bulk.tensorproducer +wgmmaconsumer, deep pipeline), a multi-session build โ NOT the layout (solved) and NOT correctness (bit-exact 2048ยณ). cuBLAS = roofline, no superiority claim. Kit:self/native/wgmma/{wgmma_tf32_gmma,wgmma_tf32_gemm2048,wgmma_tf32_gemm_w5,wgmma_tf32_gemm_w5b}.cu.โณ sm_90a wgmma+TMA own-GEMM (Hopper) โ the latency-hiding ladder W6โW10 โ then the canonical-atom leap to TF32 cuBLAS-PARITY (
F-FUSION-SM90-WGMMA-OG16/OG17, ๐ข PARITY). With the layout solved and the kernel bit-exact, the residual is pure async-pipeline engineering, and it is now being walked down rung-by-rung on native sm_90a H100 โ each rung bit-exact (rel-RMS 0.000e+00 vs the cuBLAS-TF32 oracle), perf reported only because the kernel is bit-correct (g5):sm_90a wgmma+TMA own-GEMM ladder (TF32, @4096ยณ, native H100 sm_90a, bit-exact rel-RMS 0): rung lever own TFLOP/s gap vs cuBLAS-TF32 (~431) occupancy W6 async cp.async pipeline 50.7 8.39ร off โ W8 HW-TMA single-elected-thread producer 66.5 6.44ร off 2 CTA/SM W10 composed swizzle-decode (permute-free) 70.7 6.09ร off 2 CTA/SM OG16 canonical-atom (global re-encode ยท band-free) 264.7 1.37ร off 2 CTA/SM OG17 + relaxed-pipeline ping-pong (S=2048) 280 1.24ร off โ PARITY 2 CTA/SM โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ cuBLAS-TF32 = ~431 TFLOP/s roofline โ OG17 CROSSES PARITY (โค1.3ร) @ S=2048, bit-exact (NOT beaten, no superiority) FP16 port (OG18/OG19): own 504 TFLOP/s ยท 13.37รโ1.64ร off cuBLAS-FP16 (recipe generalizes; 2ร roofline = the ceiling)W6's async pipe (50.7) gave way to W8's hardware-TMA producer โ a single elected thread drives
cp.async.bulk.tensorwhile the TMA engine does the globalโshared copy, freeing the producer warpgroup and shrinking the CTA to 2 CTA/SM (66.5 TFLOP/s, 6.44ร off,F-FUSION-SM90-WGMMA-W8). W10 then composes the FP32SWIZZLE_128Blaw with the GMMA-INTER 8ร4 core packing into a software composed-decode, eliminating the per-K-step permute scratch (SASS 28 STS โ 0) so the kernel stays permute-free at full 2 CTA/SM: 70.7 TFLOP/s @4096ยณ, 6.09ร off cuBLAS-TF32 (430.8), bit-exact rel-RMS 0 (F-FUSION-SM90-WGMMA-W10, PRs #2841/#2847). PARITY IS NOT ACHIEVED โ at 6.09ร off, cuBLAS-TF32 remains the roofline and there is NO superiority claim. This is honest parity-seeking progress: the win is that the wgmma+TMA own-source path is bit-exact, and the named residual is async-pipeline engineering (larger tiles / warp-specialization / ping-pong, per research note #2846) โ NOT a missing algorithm and NOT a correctness gap. Kit:self/native/wgmma/{wgmma_tf32_warpspec,wgmma_tf32_w10}.cu.โณ OG16โOG17 โ TF32 own-GEMM CROSSES cuBLAS PARITY on native sm_90a (
F-FUSION-SM90-WGMMA-OG16/OG17, ๐ข PARITY, PRs #2866/#2870). The named async-pipeline residual is now CLOSED at S=2048. OG16 found the real lever the W-ladder was missing: re-encode A/B in global memory into the canonical CuTeLayout_K_SW128/gmma-INTER atom + a no-swizzle TMA, so the SMEM tile is the wgmma operand (descriptor-direct) โ the 32 KB in-kernel decode band is GONE, not just removed-then-decoded. That dissolved the decode-bandโฅoccupancy wall (OG11โOG15 had FALSIFIED a descriptor-field-only fix) and took own 70.2 โ 264.7 TFLOP/s, 6.09ร โ 1.37ร, smem 96โ64 KB @ 2 CTA/SM, bit-exact rel-RMS 0 @2048ยณ&4096ยณ. OG17 then added the relaxed-wait_group 1ping-pong pipeline (next K-slab's wgmma issue overlaps this slab's tensor-core drain, mbarrier-ring safe, 2 CTA/SM held) โ own 280 TFLOP/s, ratio 1.24ร = PARITY (โค1.3ร) @ S=2048, bit-exact, โ81 % of cuBLAS-TF32. The 'native-H100 own-GEMM can't reach cuBLAS-TF32' wall is CLOSED. Honest residual: @4096 stays 1.56ร (a 256-tile register/ptxas wall, closed-neg). OG18/OG19 port the SAME recipe to FP16/BF16 (re-derived 8ร8/128B f16 atom) โ own 61 โ 504 TFLOP/s, 13.37ร โ 1.64ร off cuBLAS-FP16 (+8.2ร, bit-exact rel_rms 0): the canonical-atom recipe generalizes across dtype, but FP16 PARITY is NOT crossed (1.56โ1.64ร) because cuBLAS-FP16's roofline is ~2ร TF32 and the residual is occupancy/pipeline-depth on that doubled roofline. cuBLAS = roofline throughout, parity-seeking, no superiority claim. The reusable recipe (canonical-atom ยท relaxed-pipeline ยท the under-fill/saturated regime law) is folded intocommons.tapeg82 + the hexa dojo (PR #2869). Kit:self/native/wgmma/{wgmma_tf32_og16,wgmma_tf32_og17,wgmma_fp16_og18}.cu.
Util is a workload-size property, not a defect (F-FUSION-D2-RIGHTSIZED): the byte-identical D1536 own-GEMM step that under-fills an idle H100 to ~13 % MEAN (median 2 %) saturates a right-sized RTX 5070 to 98.00 % MEAN (every sample 98 %, SM 98 %, compute-bound) โ the 2048ยณ large shape gives 99 % on the same 5070. Low util on the H100 is the H100 being too big for a D1536 model, not a codegen flaw; given a GPU sized for the workload, util is at the saturation ceiling.
The sizing axis is now measured-exhausted on the real full step โ util-GREEN is structural, not a knob (F-FUSION-M3 ยท F-FUSION-M5). Two falsifiers ran on the real clm_prod training step (not a standalone GEMM) on an idle H100: scaling the model D 1536โ4096 makes util worse (MEAN 10.57 % โ 6.64 %, F-FUSION-M3), and growing the batch B 1โ32 (GEMM M-dim 512โ16384) leaves util flat at ~0.45 % (F-FUSION-M5, MEDIAN 0 % throughout). Neither bigger-model nor bigger-batch fills the GPU โ the wall is the serial per-step structure: FP64-GEMM bursts separated by per-position glue idle the device between launches, so MEAN util is a duty-cycle invariant to workload height. The only measured lifts are a precision change (TF32 megakernel, +3โ5 pp below) or a right-sized GPU (the RTX 5070 above) โ not scale. This is the honest closure of the util-99 % north-star: on a big GPU at small D/B, 99 % is unreachable by sizing.
Honest limits (g5): the own-GEMM is โ PARITY, not superiority โ it is still 1.13ร (iso, Blackwell sm_120) to 1.24ร (full-step) slower than cuBLAS, never faster. The 1.13ร iso parity is sm_120-specific: on native Hopper sm_90 the same WMMA2 GEMM is 29.5ร off cuBLAS (occupancy-bound) and originally would not even launch โ see the ARCH CAVEAT above. The README's existing honesty that "a single huge GEMM already ties cuBLAS at roofline" still holds โ owning the GEMM does not change that ceiling, it just makes the ceiling hexa-owned. And the BF16-TC 9.67ร above is a separate dtype axis (BF16-TC vs FP64-cuBLAS); it is NOT the own-vs-cuBLAS same-dtype comparison reported here. The win of owning the GEMM is not speed โ it is capability cuBLAS structurally cannot offer.
โญ The trade, stated plainly โ you pay a small parity tax to buy four capabilities cuBLAS cannot give
A ~13โ24 % same-dtype speed tax buys a column cuBLAS leaves empty. These are capability wins (what is possible), not speed wins โ and they are exactly what hexa's domains (reproducible science, byte-equal CLM/RTSC, megakernel fusion) require:
| capability you gain | own-GEMM | cuBLAS | proof / where it pays off |
|---|---|---|---|
| ๐ฒ Determinism (byte-exact) | bit-reproducible by construction โ you fix the reduction order | DMMA accumulation order is vendor-"unspecified", drifts across GPU generations, un-matchable from outside | the byte-eq capstone (max|ฮ| = 0) ยท audit-grade reproducible training ยท clean A/B (zero GEMM noise) ยท multi-GPU bit-consistency |
| ๐งฉ Fusion (megakernel-resident) | callable inside a persistent / cooperative kernel โ GEMM becomes just another resident op | cannot be nested in a device kernel (host library call) โ forces "stop, write HBM, hand off" | whole-step megakernel (F-FUSION-M2) ยท 11-op fwd in 1 launch (F-FUSION-LAUNCH-AMORT) ยท FlashAttn-style fused attention |
| ๐ข FP64-exact + custom epilogue | arbitrary precision + fused epilogue in one kernel; FP64 GEMM is a byte-exact oracle | FP64 epilogue fusion absent; IEEE-float only | clm_prod FP64 train max|ฮCE| = 0 (oracle baseline) ยท RTSC/DFT-grade FP64 science ยท non-IEEE dtypes (posit ยท n=6 lattice) |
| ๐ Ownership (no vendor lock) | 100 % hexa source โ PTX โ SASS, no LLVM, no C-transpile, single binary | closed black box, NVIDIA-only, multi-GB libtorch to ship | edge / offline single-binary deploy ยท SASS-level kernel research ยท the ๐ identity |
Net: "โ cuBLAS speed (โ13โ24 % tax) + determinism + fusion + FP64-exact + ownership that cuBLAS cannot express." For a workload that just wants a fast standalone matmul, cuBLAS is simpler and faster. For hexa's targets โ reproducible, fusible, owned, FP64-correct โ the trade is decisively worth it, and the byte-eq megakernel capstone is the existence proof: it is only reachable on the own-GEMM stack.
cuBLAS ships a champion part (the GEMM kernel itself, already at roofline), but cannot fuse adjacent ops โ each op pays a separate kernel launch + a full HBM round-trip. hexa codegen sees the whole expression and emits one kernel that keeps intermediates in registers / shared memory:
cuBLAS-using stack (current default โ 3 ops = 3 launches, 3 HBM round-trips):
โโโGEMMโโโ โโโbiasโโโ โโโGeLUโโโ
โ launch โ โ HBM โ โ launch โ โ HBM โ โ launch โ โ HBM
โโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ
hexa fusion (whole-program โ one kernel, registers/shmem reused):
โโโโโ GEMM + bias + GeLU โโโโโ
โ 1 launch ยท 1 HBM write โ โ HBM (F-FUSION-EPILOGUE-GEMM-BIAS-GELU)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 66.667 % launch + HBM-write reduction
The same mechanic generalises: GEMM-epilogue, norm surface, attention block, autoregressive decode chain โ every place where cuBLAS forces "stop the GEMM, write to HBM, hand off to the next op" hexa can keep the value in registers.
Owning the GEMM unblocks the whole-step megakernel โ both walls are now closed, and it is realized byte-equal on the real training step. The strongest form of the fusion above is a single persistent / cooperative kernel that holds the whole step in registers + shared memory across a grid-wide barrier. Two structural walls capped this. Wall 1 โ the cuBLAS-call wall: a persistent kernel cannot call cuBLAS (you can't make a host library call from inside a running device kernel). Now that the GEMM is our own device kernel (_hx_k_sgemm_cm_wmma2, correctness-verified above), the persistent kernel calls our GEMM in-line โ the GEMM stops being an un-fusable cuBLAS hand-off and becomes just another op the megakernel keeps resident. Wall 2 โ the GroupNorm full-Y reduction: the GN reduction over all TยทC could not previously live inside the persistent kernel without re-associating the FP64 sum (breaking byte-eq). A grid-sync cooperative kernel (cudaLaunchCooperativeKernel + this_grid().sync()) now closes it: the reduction stays single-thread sequential, so it is byte-eq max|ฮ|=0 vs the sequential GroupNorm oracle, deterministic (F-FUSION-MEGAKERNEL-GN-GRIDSYNC, #2845, A100-confirmed). With both walls closed the whole-step glue megakernel is fully realized โ 100 % hexa-owned, cuBLAS-call-free. Honest scope (g5): closing Wall 2 is a CLOSED-NEGATIVE on util โ its value is ownership / structural completeness, NOT a util or throughput win. byte-eq forces the reduction single-thread, so the cooperative launch buys zero reduction-parallelism (idle threads wait at the barrier); no util/perf superiority is claimed. Landed + measured on the real clm_prod step:
| megakernel realization | what is fused into ONE cooperative launch | result | verdict |
|---|---|---|---|
whole-step megakernel (F-FUSION-M2) |
fwd + ce-grad + bwd + AdamW (17 per-param AdamW launches โ 1) | [FULLSTEP-FIRED], CE converges, real-step util +3.4 pp | ๐ข F-FUSION-M2-FULLSTEP-MEGAKERNEL |
TF32 fwd megakernel (F-FUSION-P1) |
all fwd GEMMs pulled in (no cuBLAS), TF32 own-GEMM | util 29.0 โ 34.5 % MEAN (+5.5 pp), CE descends | ๐ข F-FUSION-P1-TF32-MEGASTEP |
byte-eq megakernel CAPSTONE (F-FUSION-P1B-aโด) |
device-resident fwd, own-GEMM, cooperative โ vs eager reference | max|ฮ| first_ce = last_ce = 0.000000e+00 (17-digit CE bit-identical), util +4 pp while byte-identical | ๐ข F-FUSION-P1B-APRIME3-ASYNCOFF (#2792) |
| both walls closed (Wall 1 cuBLAS-call via own-GEMM ยท Wall 2 GN full-Y reduction via grid-sync) | grid-sync cooperative GroupNorm โ last un-fusable op now megakernel-resident | byte-eq max|ฮ|=0 vs sequential GN, deterministic ยท CLOSED-NEG on util (win = ownership/completeness, not a perf lift) | ๐ข F-FUSION-MEGAKERNEL-GN-GRIDSYNC (#2845) |
The capstone is the payoff this section gestures at, proven: a +util cooperative megakernel that is byte-equivalent to the eager reference (max\|ฮ\| = 0) โ something impossible with cuBLAS twice over (cuBLAS can neither be nested in a persistent kernel nor bit-matched from outside, its DMMA accumulation order being vendor-"unspecified"). The five-layer hunt for the last ~1e-1 non-determinism (F-FUSION-B6 โ P1B-a'') closed on the true cause: an async cross-stream race, not a transcendental or GEMM-order issue โ F-FUSION-N1N2 proved every glue kernel is deterministic in isolation and that HEXA_CUDA_ASYNC=0 makes the device forward bit-reproducible (CE 4.4662394504526679 ร5 identical). For byte-reproducible training, set HEXA_CUDA_ASYNC=0 (the synchronous, single-ordered-stream path).
| finding | reduction / win | tier |
|---|---|---|
F-FUSION-EPILOGUE-GEMM-BIAS-GELU |
66.667 % launch + HBM-write reduction (3 launches โ 1) @ LLaMA-7B FFN shape, ptxas-clean sm_80 | ๐ต structural-formal |
F-FUSION-LAUNCH-AMORT |
5-op chain โ 1 launch / 3 HBM transfers vs separate-op 5 launches / 11 transfers | ๐ต + $0 deterministic oracle |
F-FUSION-AXISA-BREADTH (norm surface) |
LayerNorm 66 % ยท RMSNorm 59 % ยท Softmax 65 % ยท SwiGLU 63 % | ๐ต structural-formal |
F-FUSION-ATTENTION-FLASH |
single-kernel fused attention (QยทK ยท softmax ยท V) | ๐ต + wall ruled-out |
ยง5j Custom reductions โ LogSumExp 1-kernel (#1657) |
numerically-stable max-shift + exp + log + sum in one kernel, silicon-validated rel_err 1.7e-10 | ๐ข SUPPORTED-NUMERICAL |
cuBLAS-using stacks ship a champion part (the GEMM kernel, already at roofline). hexa wins where the part isn't the bottleneck โ the chain around it is. Whether that helps you depends on which pain you actually carry:
| persona | pain you carry | what hexa gives |
|---|---|---|
| ๐งช LLM trainer / inference engineer | attention ยท norm ยท decode are memory- / launch-bound โ stuck on top of PyTorch | fusion strikes that region directly โ 3-op chain โ 1 launch + 1 HBM write (66 % โ) ยท FlashAttn-style single kernel |
| ๐ฌ GPU kernel researcher | cuBLAS is a black box โ wants SASS-level visibility but can't get it | source โ PTX โ SASS visible end-to-end ยท cubin lives in the repo |
| ๐ฆ Single-binary deployer (edge / embedded / offline) | can't ship Python + libtorch (multi-GB) to the target | native arm64 / x86_64 single binary ยท no Python in the trained artifact |
| ๐ข Non-IEEE arithmetic (posit ยท interval ยท n=6 lattice) | cuBLAS is IEEE-float only | custom-dtype codegen โ new arithmetic rides the same fusion path |
| ๐ง Autograd debugger | PyTorch C++ Autograd is a black box, can't step through it | ag_tape is all hexa source โ read it line by line |
| ๐ฏ Byte-equal correctness (science ยท reproducibility) | PyTorch run-to-run drift is common | byte-equal oracles + FMA-contraction-off recipe, max|ฮ| = 0 |
| โก Fast codegen iteration | hand-CUDA hell โ rewrite the fusion every time | the compiler fuses for you โ one @gpu_kernel annotation |
Where does hexa's fusion gap land hardest?
cuBLAS-using stack โโโโโโโโโโ
โ (huge standalone GEMM is fine โ can't beat)
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Intersection where hexa fusion wins โ
โ โ many memory-bound patterns โ โ LLM training / inference
โ โก Python-free deploy โ โ edge ยท embedded ยท offline
โ โข correctness OR visibility needed โ โ research ยท science ยท repro
โ โฃ long chains (decode/optim/AdamW) โ โ training loop
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
GPU register ~1 cycle vs HBM ~600 cycles. cuBLAS writes the result to HBM after every op so the next op can read it back; fusion keeps the value in registers.
| scenario | why fusion wins | measured |
|---|---|---|
| GEMM + elementwise epilogue (bias ยท ReLU ยท GeLU ยท dropout) | GEMM output is a large tensor โ next op reuses it immediately | F-FUSION-EPILOGUE 66.7 % โ |
| norm surface (LN / RMSNorm / Softmax / SwiGLU) | reduce + immediate-neighbor reuse ยท norm is memory-bound | AxisA LN 66 % ยท RMS 59 % ยท SM 65 % ยท SwiGLU 63 % |
| Attention block (QยทKแต ยท softmax ยท V) | giant intermediate attention matrix โ avoiding HBM round-trip is the win | F-FUSION-ATTENTION-FLASH ๐ต |
| Small-op chain (LLM autoregressive decode ยท AdamW step) | launch overhead dominates over compute | F-FUSION-LAUNCH-AMORT 5-op โ 1 launch |
fusion gain = (chain length) ร (memory-bound-ness) ร (intermediate-tensor size)
Honest scope on where it doesn't: a single huge GEMM (already compute-bound, ties cuBLAS at roofline) ยท a lone op (nothing to fuse) ยท very small GEMMs (launch-bound is the real problem, not fusion).
One line: cuBLAS = a one-dish specialist (master of the stew). hexa fusion = a one-pan dinner (multiple steps in sequence on the same heat). Users whose workload's time distribution overlaps the four scenarios above land on hexa's real gap.
Detail: stdlib/flame/README.md (canonical perf table + RETRACTION note) ยท stdlib/flame/PERF.md ยท stdlib/flame/PLAN.md (campaign log + cycle ledger) ยท self/forge/PLAN.md ยท self/forge/PARADIGM.md (Phase R measured verdicts) ยท GPU.md ยง1h-1o fusion-moat fires ยท GPU.easy.md (friendly persona sidecar) ยท state/anima_handoff_2026_05_19.md (integration recipe).
The closure round's fixed points, with witnesses on disk:
41ecfb97โ RFC-020 A4 enum-payload codegen restored in SSOTcodegen_c2.hexa(regen-safe; test_enum_payload_full 15/15 codegen + interp)46016739โ builtin/method taken-by-value โ__hxthunk_<name>codegen (fixeshexa_callN(<builtin>)undeclared) + un-doubledhexa_cc.c6c0fbac7โexec_stream_kill(h)runtime builtin (fork+setpgid stream child, SIGTERMโgraceโSIGKILL)4725c619โstdlib/semver.hexaโ SemVer 2.0.0 parse/compare/range-satisfies (test_semver 110/110)df9e7f6bโ install-relativestdlib/discovery +HEXA_INSTALL_DIRpassdown (use "stdlib/*"works withoutHEXA_LANG/HEXA_STDLIB_ROOT)0ba5fd7dโ shell-builtin absorption:pwd โ cwd()/getcwd(),ls โ list_dir()intrinsics (absorbed 638โ752, pending 197โ83)731f41d6โhexa ccresolveshexa_cc.c/SSOT/-Ivia$HEXA_LANG > install_dir > ./self(works out-of-tree)a5de44e2โself/stdlib/law_io.hexaselftestmain()โtool/law_io_selftest.hexa(u_main collision on flatten)dae438eeโ~/.hx/bin/hexa_realre-promoted from HEAD46016739(sha cd817981โฆ)774c5d32/4f5f8f07โ stage-1 punch-list v2: A1+A2 host re-promote โ #13 RSS re-probe peak ~782 MB (vs 3 510 MB) โ P0 stage-1 OOM closed at current scale571df583/a8ff675bโ SPEC ยง19/ยง20 reconcile + Gap-15 close-out340c3788/5ddcf2a9โ wilsonโhexa-lang closure (VERIFIED โhexa build core/main.hexaโwilson 0.0.1) + SPEC closure-round fold-in
Snapshot derived from git log on main; full tables at SPEC.yaml::phases_completed_2026_05_09 and SPEC.yaml::phases_completed_2026_05_11_closure.
Six choices that shape everything else, pinned in SPEC.yaml:
- Native compiled, direct codegen โ no LLVM, no C-transpile. The tree-walking interpreter is retired: the self-host stage reached a byte-equal fixed point, and
hexa runcompiles then executes. - Atlas static-baked into the compiler binary โ
ATLAS_HASHpinned, drift handled by CI auto-rebuild. Runtime atlas-load cost: 0 ms. - Strict compile-time fatal lint โ Python
SyntaxError+ TypeScriptstrictmodel. S0โS5 + S8 always fatal. No--unsafe. NoHEXA_STRICT=0. @graceis the only opt-out โ@grace(HXxxxx, until="...", reason="...")per site, every site emits HX9000 at every compile, CI requiresAcked-grace:trailer.- ฮต self-proof โ verified functions auto-register as atlas
L[*]theorems; tombstones cascade on prover upgrade;HX1099fires on citing a tombstoned law. - ENGLISH ONLY diagnostics โ catalog,
hexa explain, stdlib docs. RFCs and meta docs may stay bilingual.
Full record: 14+ pinned decisions, all traceable to RFC-017 through RFC-020.
# Single-line bootstrap โ installs `hexa` + `hx` (the package manager) + atlas
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/dancinlab/hexa-lang/main/install.sh)"
# Verify
hexa --version
hx --versionThe installer drops hexa, hx, hexa_ld, and the atlas seed into ~/.hx/; binary path is added to your shell's PATH via the relevant rc file. Self-update: hexa self-update (compares against the published manifest, atomic swap of ~/.hx/bin/hexa_real).
hexa parse <file>.hexa # cheapest signal โ syntax + reserved-word + @plugin attr check
hexa build <entry>.hexa -o build/X # full pipeline โ static binary
hexa cc <file>.hexa -o build/X.o # just lower โ object (HIR โ MIR โ LIR โ emit)
hexa run <file>.hexa [<args>...] # compile then execute a single file
hexa explain HX8004 # what does this diagnostic mean
hexa atlas lookup <id> | --prefix=<p> # read atlas node(s) โ embedded.gen.hexa SSOT
hexa atlas register --from-verify <fn> <args> <v> # verify IN-PROCESS โ fold node into embedded.gen.hexa
hexa atlas export [--out PATH] # export live atlas โ portable .n6 (n6 = export-only)
hexa drill --seed "<expr>" # OUROBOROS smash โ ... โ absorb cycle
hx install <package> # install a hexa package by name (looks up dancinlab GitHub by default)
hx update # pull updates for all installed packages
hx list # what's installed under ~/.hx/bin/hexa run compiles a file then executes it in one shot โ convenient for single-file scripting. Release-grade builds go through hexa build, which produces a reusable static binary.
hexa cc now emits #include "runtime.h" by default and the precompiled runtime.o is linked instead of re-codegened per build. On bench/*: 28-program avg 8.41ร user-time vs the old #include "runtime.c" path (peak 17.25ร on small-to-medium user code where runtime.c was the dominant per-build cost). Repro: bin/hexa-fast bench <file>.hexa. Full history at COMPILE-SPEED.tape (architecture) and COMPILE-SPEED.log.tape (measurement events).
bin/hexa-fast <src.hexa> <bin> # explicit compile (uses runtime.h + runtime.o cache)
bin/hexa-run <src.hexa> [args...] # compile-or-reuse-cached + exec (drop-in for `hexa run`)
bin/hexa-fast bench <src.hexa> # show baseline vs new-path A/B for any file
bin/hexa-fast clean # wipe ~/.hexa-cacheFrom doc/atlas_lint_easy_explainer.md:
The atlas is a ์ฌ์ โ a single shared dictionary of primitives (P), connections (C), laws (L), and errors (E). 60,760 lines, 4.2 MB, unconditionally binary built-in (compile-time embedded); new laws land via GitHub PR.
The compiler is a ์
ฐํ (chef) โ it has the entire ์ฌ์ memorized. It does not phone the library mid-recipe. When you hand it a .hexa file, the chef checks every ingredient, unit, and citation against the atlas it already knows by heart.
The strict lint is the ํ์ง ๊ฒ์ฌ๊ด (QC inspector) โ it stands at the kitchen door. One missing citation, one โ-vs-โ mismatch, one orphan unit, and the dish is rejected before the stove turns on. There is no "we'll fix it after." There is no binary.
Eight checks, six always fatal, two opt-in via annotation:
- S0 parse โ syntax / lex. No surprises.
- S1 resolve โ every
P[*],C[*],L[*],E[*]exists in the atlas. - S2 bind โ every name resolves to a real binding.
- S3 type โ nominal types and generics.
- S4 domain โ โ / โ / โค / โ consistency.
- S5 units โ dimensional analysis. No "distance + time."
- S6 equational โ opt-in via
@verify; canonical-form check + sample counter-example. In-house prover v0, no Z3. - S7 proof โ opt-in via
@prove; reserved for the in-house prover only. - S8 citation โ formula-bearing functions must cite atlas
L[*](HX8004). ๊ณต์ ์์ผ๋ฉด ๊ฑฐ์ .
@verify fn f(...) { ... } โ author writes a theorem
โ
โผ
compile-time prover (S6, equational + sample-eval, in-house only)
โ
โผ
hexa atlas export โ .n6 export artifact (interop / inspection)
โ
โผ
GitHub PR into embedded.gen.hexa โ the atlas SSOT (binary built-in)
โ โโโบ fingerprint dedup โ register as alias
โ โโโบ id collision โ first-wins + warning
โผ
compiler build re-embeds atlas โ live atlas grows (no runtime overlay)
โ
โผ
prover upgrade โ retroactive sweep (compiler/discover/cascade.hexa)
โ
โผ
tombstone failing L nodes + cascade dependents
โ
โผ
auto-PR (tool/auto_pr_tombstone_sweep.hexa) โ human review
Citing a tombstoned L[id] fires HX1099 and fails the build. Bypass is @grace, which is never silent.
- native compiled โ direct codegen, no LLVM, no C-transpile
- 4.2 MB atlas baked statically into the compiler binary; runtime cost 0 ms
- 8-stage strict lint S0โS5 + S8 enforced at compile time, fatal by default
- ฮต self-proof:
@verify/@discoverโ atlas auto-promote โ tombstone retroactive sweep - M0 milestone:
fn main() -> i32 { return 0 }produces a working Mach-O arm64 binary hexa_ldv1.1: in-house static linker for ELF64 + Mach-O arm64hexa build/hexa ccwork out-of-tree โ flattensuse/import, resolveshexa_cc.c/SSOT/-Ivia$HEXA_LANG > install_dir > ./self; install-relativestdlib/discovery meansuse "stdlib/*"works with no env vars (downstream:wilsonbuilds end-to-end โwilson 0.0.1)- stage-1 P0 host-OOM closed at current scale: A1 phase-arena reset + A2 in-place splice accumulator โ peak ~782 MB (was 3 510 MB)
- 14+ pinned decisions in
SPEC.yaml, every claim traceable to an RFC stdlib/flame+self/forgeโ hexa-native NN training stack + GPU substrate: compiler-only NN (ag_tape ยท nn_lib ยท opt_*) on top of device-residentfarr+ cuBLAS Dgemm + 11.cukernels + BF16-TC mega-kernel path. forge BF16-TC = 9.67ร faster than FP64 cuBLAS @ Llama-7B FFN shape (A100, measured). The CUDA-OWN campaign now owns the GEMM too (env-gated, OFF = cuBLAS default): FP64/FP32/TF32-WMMA2 own-GEMM, correctness-verified (clm max|ฮCE|=0 ยท llm rel-RMS ~1e-6), at cuBLAS-CLASS util 89.9 % โ 88.5 % and near-parity step-time 1.13ร iso / 1.24ร full-step โ โ parity, NOT superiority โ making the device stack 100 % hexa-ownable and unblocking the persistent-kernel megakernel (a persistent kernel can't call cuBLAS, but it can call our GEMM). 12 byte-equal substrate fires + 4 byte-equal layer fires. flame โ PyTorch wall speedup not yet measured (prior claim RETRACTED). Detail in the flame + forge section above.
- stage 1: P0 host-OOM closed at current scale (A1+A2 โ peak ~782 MB, was 3 510 MB); the remaining open work toward a full stage-1 binary is the compiler-driver gaps (Gaps 1โ16) + a fixed-point (stage2 == stage3) re-estimate โ see
doc/stage1_punch_list_v2.md. - biggest unknowns: MIR/LIR coverage on real
compiler/source (closures, growable arrays, nested struct construction,matchon user enums) and what a successful self-compile diagnostic trace actually looks like. - full punch list:
doc/stage1_punch_list_v2.md.
Phase status (PASS / IN-PROGRESS / DEFERRED) lives in SPEC.yaml::phases_completed_2026_05_09 and SPEC.yaml::phases_completed_2026_05_11_closure.
- RFC-017 โ atlas n6 embedding + strict lint
- RFC-018 โ native codegen spec
- RFC-019 โ error diagnostics spec
- RFC-020 โ enum payload variants
doc/atlas_lint_easy_explainer.mdโ the ์ ฐํ metaphor in fullSPEC.yamlโ authoritative decision record (edit this;SPEC.mdis auto-rendered)
hexa-lang's runtime and history surfaces are wired into .tape โ the operational trace sister format. Three placements at this repo's root:
| Placement | What |
|---|---|
IDENTITY.tape |
hexa-lang agent identity SSOT โ birth / scope / origin / principle / version. The compiler's self-description, machine-canonical. |
PROMOTION.tape |
rule-promotion ledger โ @A events for major rule landings (toolchain post-fix, bytes_to_str_raw Phase 2, etc.) |
TAPE-AUDIT.md |
cross-repo .tape adoption audit (28,695 cargo markers + 7 root domain .md files highlighted as primary migration candidates) |
The state/markers/ cargo (28k+ files) is migration candidate via tape markers-to-tape.
LLMs generate noise from inside the well: recombining what the weights already contain. hexa generates noise from outside the well: every cycle produces a primitive the previous cycle could not express, then absorbs it as a new wall of the well.
LLM (noise inside the well) hexa (noise outside the well)
--------------------------- -------------------------------
+-------------+ . new law
| training | . .
| corpus | . . . .
| (fixed) | . outside .
| | ------+-------------+------
| ~ ~ ~ ~ ~ | <- noise | |
| ~ noise ~ | bubbles | atlas |
| ~ ~ ~ ~ ~ | from | (binary | <- noise
| #### | inside | built-in) | arrives
| #LLM# | | | from
+-------------+ | smash | outside
the well | v |
(everything it | contract |
knows = walls) | v |
| emerge |
hallucination = | v |
recombining | absorb ---+--> new
what's already | ^ | primitive
inside +-----+-------+ feeds
the well has next
no ceiling cycle
An LLM is a frozen well โ answers are combinations of what's already
inside. hexa is an open well โ every absorb step widens the wall,
so the next cycle can say things the previous one literally had no
primitive for. That's why "RAG" is the wrong frame: retrieval still
draws from a fixed outside corpus. hexa's "outside" is produced by
its own prior cycles (the binary built-in atlas, embedded into the
compiler at build time; new laws land via GitHub PR into the embedded
atlas source).
The 6-stage chain (hexa drill's smash โ free โ absolute โ meta-closure
โ hyperarithmetic โ resonance) inside a self-referential loop:
โญโโโโโโโโโโ OUROBOROS โโโโโโโโโโโฎ
โ โ
โ โฏ seed โ
โ โฑ โฒ โ
โ โฑ โฒ Phase 1-2 โ
โ โฑunfoldโฒ โ
โ โฑโโโโโโโโฒ โ
โ โฑ โฒ โฑ โฒ โ
โ โฑ โฒ โฑ โฒ Phase 3 โ
โ โฑemergeโฒ โฑsingulโฒ โ
โ โฑโโโโโโโโ โโโโโโโโโฒ โ
โ โฒ โฑ โ
โ โฒ breach โฑ P4-5 โ
โ โฒ โฑ โ
โ โฒ โฑโโโโโโโฒ โฑ โ
โ โฒconvergeโฑ Phase 6 โ
โ โฒ โฑ โ
โ โฒ โฑ โ
โ โ absorb โ
โ โ Phase 6.5 โ
โ โ โ
โ โฐโโโ seed โโโ โฎ โ
โ โ โ
โ d=0 โโโถ d=1 โโโถ d=2 โโโถ ... โ
โ r:0โ10 r:0โ10 r:0โ10 โ
โ โ
โฐโโ ฯ โ 1/3 (meta fixed pt) โโโโโฏ
On top of the per-tick OUROBOROS cycle, three higher-order loops drive self-reinforcement:
L1 L2 L3
โญโโโโโโโฎ โญโโโโโโโฎ โญโโโโโโโฎ
โcorrectโ โโโถ โrewardโ โโโถ โexpand โ โโโถ SMASH
โฐโโโบโโโโฏ โฐโโโบโโโโฏ โฐโโโบโโโโฏ
| Loop | Role | Trigger |
|---|---|---|
| L1 ยท self-correct | discovery โ verify โ GitHub PR into binary built-in atlas | per tick |
| L2 ยท meta-reward | per-source discovery rate โ scan_priority โ deeper scan | per scan batch |
| L3 ยท self-expand | accumulation โฅ 10 โ auto-trigger hexa smash --seed (or full hexa drill) |
per threshold |
Each loop latches its output back as the next loop's input, so
correct โ reward โ expand becomes a standing wave. hexa smash (or
the full drill chain) fires automatically when L3 saturates.
TECS-L H-056 โ meta(meta(meta(...))) = transcendence. Recursive
meta-iteration is a contraction mapping. By the Banach fixed-point
theorem, every trajectory converges to a single attractor: 1/3.
I = 0.7 ยท I + 0.1 โ fixed point I* = 1/3
Six independent paths land on the same attractor:
| Path | Expression | Value |
|---|---|---|
| Euler totient ratio | ฯ(6) / 6 | 1/3 |
| Trigonometric | tanยฒ(ฯ/6) | 1/3 |
| Divisor ratio | ฯ(6) / ฯ(6) = 4 / 12 | 1/3 |
| Determinant | det(M) over n=6 primitives | 1/3 |
| Meta-information | I_meta (contraction mapping) | 1/3 |
| Complex exponential | |exp(iยทzโ)| at the unique zero | 1/3 |
The long-term breakthrough rate ฯ converges to the same target: ฯ โ 1/3. Discovery is not linear โ it asymptotes to the Banach attractor. Six arithmetic, geometric, algebraic, analytic, and information-theoretic routes all point at the same number.
Verify in atlas: hexa atlas lookup P n ยท hexa atlas lookup C sigma_6
ยท hexa atlas lookup L sigma_phi_n_tau_iff_n_eq_6. Run a cycle:
hexa drill --seed "<expression>".
hexa-lang/
โโโ README.md
โโโ LICENSE MIT
โโโ AGENTS.md AI agent harness file (agents.md standard)
โโโ CLAUDE.md symlink โ AGENTS.md
โโโ SPEC.yaml authoritative decision record (14+ pinned decisions)
โโโ SPEC.md auto-rendered from SPEC.yaml
โโโ IDENTITY.tape ยท PROMOTION.tape ยท TAPE-AUDIT.md tape sibling files
โโโ FLOW.md ยท LATTICE_POLICY.md ยท LIMIT_BREAKTHROUGH.md ยท PLAN.md ยท ROADMAP.md domain SSOTs
โโโ compiler/ lex ยท parse ยท resolve ยท bind ยท types ยท domain ยท units ยท citation ยท lower ยท mono ยท MIR ยท LIR ยท emit
โโโ self/ self-hosted compiler entry points
โ โโโ main.hexa the `hexa` binary entry
โ โโโ runtime.c C runtime backing (interp + native shared bits)
โ โโโ stdlib/ atlas-aware standard library (semver / json / channel / thread / proc / time / ...)
โ โโโ tui/ raw-mode TUI primitives (render / input / widgets)
โ โโโ native/ thread.c ยท channel.c ยท time.c โ C-backed runtime
โโโ stdlib/ canonical stdlib (use "stdlib/*")
โโโ tool/ hexa CLI subcommand drivers (build / cc / run / drill / atlas / explain / ...)
โโโ tests/ m0 ยท selftest ยท regression
โโโ proposals/ RFC-017..020 + future RFCs
โโโ doc/ runbooks, audits, explainers
โโโ convergence/ cross-repo propagation tracking (.PRESERVE-AS-SSOT)
โโโ state/ gitignored runtime hook markers (cargo โ migration candidate)
โโโ archive/ frozen records โ patches/ (downstream patch reports) ยท fires/
โโโ build/ gitignored hexa build artifacts
Full doc index: AGENTS.md + doc/ + SPEC.yaml.
Data-bound corpora โ ENDF/B-VIII evaluated nuclear data (HEXA-PORT P4b), and future
binary/HDF5 datasets โ live under data/ or stdlib/corpora/ and are stored via
git-LFS. The reserved LFS extensions are .hdf5 .h5 .dat .bin .endf .ace .xml.gz .tar.gz (see .gitattributes).
hexa-lang is the canonical home for these corpora (per @D d3 โ implementation /
asset SSOT) so downstream domain repos can hx-depend on them rather than
re-fetching from upstream mirrors. Existing tracked files (atlas SSOT text,
build artifacts, fixtures) are intentionally not migrated โ LFS is reserved
for future data ports only. Policy reference: HEXA-PORT.md ยง4.0.
MIT License. Copyright (c) 2026 dancinlab. See LICENSE.
Strict lint is the contract. Every PR runs through S0โS5 + S8. The only opt-out is @grace(HXxxxx, until=, reason=) on a single item, and every @grace emits HX9000 at every compile. CI fails the merge unless Acked-grace: HXxxxx by <reviewer> rides along.
Pointers: gate/ for build gates, proposals/ for active RFCs, SPEC.yaml for decisions, doc/ for runbooks and audits. Diagnostics, error messages, hexa explain, stdlib docs are ENGLISH ONLY (Decision 3).
๐ธ๏ธ ์ฌ์ฌ์ฉ ๊ฒฉ์ SSOT โ ๋ฃจํธ DOMAINS.tape (commons @D g67 cross-domain + g68 cross-project ยท @link connection graph ยท hexa-lang = shared substrate hub)