Skip to content

dancinlab/hexa-lang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9,043 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

hexa-lang

๐Ÿ’Ž hexa-lang

Native compiler with atlas-bound theorems โ€” strict-lint ยท citation-enforced ยท no LLVM ยท no C-transpile

License CI DOI Phase M0 Atlas Sibling

Atlas-bound ยท strict-lint ยท 8-stage gate ยท ฮต self-proof ยท n=6 perfect-number primitives ยท self-hosted


hexa-lang is a native compiler that carries its own theorem ์‚ฌ์ „ (dictionary) inside the binary. No LLVM. No C-transpile. Every formula in your code either cites the atlas or the build refuses to start. The stricter the gate, the cleaner the code that passes.

Note

Sister of n6 (semantic atom layer โ€” atlas serialisation format), hxc (byte-canonical wire), and tape (operational trace). hexa-lang's atlas is unconditionally binary built-in โ€” compile-time embedded into the compiler โ€” and .n6 is the sister serialisation format emitted on demand by hexa atlas export for interop / inspection. Discovered laws are absorbed via GitHub PR directly into the embedded atlas, not through a runtime .n6 overlay. The wilson agent (dancinlab/wilson) is built end-to-end on hexa-lang.

At a glance

@cite(L[sigma_phi_n_tau_iff_n_eq_6])
fn perfect_at_six() -> bool {
    let n = 6
    return sigma(n) == 2 * n          // ฯƒ(6) = 12 = 2ยท6
        && phi(n) * tau(n) == 8       // ฯ†(6)ยทฯ„(6) = 2ยท4 = 8 = ฯƒ(n)โˆ’nโˆ’ฯ†(n)+1
}

// Untouched citation = HX8004 fatal at compile time:
//
//   error[HX8004]: formula-bearing function does not cite atlas L[*]
//     --> src/foo.hexa:14:1
//      |
//   14 | fn area_of_circle(r: f64) -> f64 {
//      | ^^^^^^^^^^^^^^^^^ formula here
//      = note: cite an atlas law via `@cite(L[id])` or declare `@grace(HX8004, until=, reason=)`
//      = help:  hexa atlas search "ฯ€rยฒ"   โ†’  L[circle_area]

The compiler stays parked unless every formula either cites the atlas, has an active @verify, or carries an explicit @grace. There is no "we'll fix it after." There is no binary.

Why hexa-lang

LLMs answer by recombining what their weights already contain โ€” noise from inside a frozen well. hexa-lang generates from outside the well: every compile cycle produces a primitive the previous cycle could not express, then absorbs it as a new wall (@verify โ†’ atlas promote โ†’ tombstone retroactive sweep). The atlas grows; hallucination is mechanically excluded because every claim must trace to a citation.

The second pillar is enforcement at the build gate, not at runtime. Eight strict-lint stages (S0 parse โ†’ S1 resolve โ†’ S2 bind โ†’ S3 type โ†’ S4 domain โ†’ S5 units โ†’ S6 equational @verify โ†’ S7 proof @prove โ†’ S8 citation HX8004) reject formula-bearing code that doesn't cite. No annotations means no formula. No formula in a non-cited function means a hard error.

Third: n=6 perfect-number primitives. The compiler is a ์…ฐํ”„ (chef) with a 4.2 MB atlas baked statically into the binary โ€” 60,760 lines of P (primitives) / C (constants) / L (laws) / E (errors). Citing L[sigma_phi_n_tau_iff_n_eq_6] is one keystroke; if the law is wrong, every dependent gets a tombstone cascade with an auto-PR.

Pipeline

   .hexa source
        โ”‚
        โ–ผ
   lex โ”€โ–บ parse โ”€โ–บ resolve โ”€โ–บ bind โ”€โ–บ types โ”€โ–บ domain โ”€โ–บ units โ”€โ–บ citation
                    (S1)      (S2)    (S3)     (S4)     (S5)      (S8)
        โ”‚                                                            โ”‚
        โ”‚                  any fatal stage โ†’ no binary               โ”‚
        โ–ผ                                                            โ–ผ
   lower (HIR) โ”€โ–บ mono โ”€โ–บ MIR (SSA) โ”€โ–บ optimize โ”€โ–บ regalloc (LIR) โ”€โ–บ emit (asm)
        โ”‚                                                            โ”‚
        โ–ผ                                                            โ–ผ
                                  hexa_ld v1.1
                          ELF64 + Mach-O arm64 static
                                       โ”‚
                                       โ–ผ
                                 native binary

A binary appears only when every fatal stage passes. The atlas (4.2 MB) is baked in at compile time โ€” runtime cost: 0 ms.


๐Ÿ”ฅ flame ยท ๐Ÿ”ง forge ยท โšก hexa-cuda โ€” the hexa GPU stack (train ยท substrate ยท kernel-authoring)

stdlib/flame is what you build with hexa-lang: a compiler-only neural-network training stdlib (autograd tape ยท layers ยท optimizers ยท tensor primitives) lowered through the same 8-stage strict-lint gate that compiles the compiler itself. No PyTorch wrapping, no ATen import, no Python in the trained binary.

self/forge is what flame calls into: a GPU substrate that pairs device-resident hexa arrays (farr) with vendor-grade kernels (cuBLAS Dgemm + 11 hand-emit .cu kernels covering the elementwise / reduction / norm surface) under a byte-equal correctness contract, plus a BF16 Tensor-Core "mega-kernel" path (RFC 049/060) for the in-kernel-GEMM regime where vendor libs are reachable.

@gpu_kernel โ†’ nvptx (hexa-cuda) is how you author a GPU kernel without leaving hexa: annotate a function @gpu_kernel, write it with the device intrinsics (gpu_thread_id_x ยท @shared let ยท gpu_barrier ยท gpu_atomic_add ยท gpu_warp_shuffle), and hexa build --target=nvptx emits ptxas-clean PTX for sm_80 / sm_90 โ€” no .cu, no nvcc, no CUDA-C transpile (silicon-proven: vec-add / saxpy run bit-exact on a native H100). It is the kernel-authoring primitive that forge's own device kernels and your own custom kernels both share; you practice it in hexa dojo.

The three pillars (flame:forge :: torch:ATen, with hexa-cuda as the kernel-authoring leg both rest on):

              hexa source (.hexa)
                     โ”‚
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ ๐Ÿ”ฅ flame โ€” NN training stdlib       โ”‚   โ† what you TRAIN with
   โ”‚   t_* tensor ยท ag_tape autograd     โ”‚     (no Python in the binary)
   โ”‚   nn_lib layers ยท opt_* optimizer   โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚  rides
                     โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ ๐Ÿ”ง forge โ€” GPU substrate            โ”‚   โ† what flame CALLS INTO
   โ”‚   farr device array ยท own-GEMM      โ”‚     cuBLAS Dgemm + 11 .cu
   โ”‚   BF16-TC mega-kernel               โ”‚     RFC 040/041/049/060
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚  device kernels authored in
                     โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚ โšก hexa-cuda โ€” @gpu_kernel โ†’ nvptx  โ”‚   โ† how you WRITE a GPU kernel
   โ”‚   gpu_thread_id ยท @shared ยท barrier โ”‚     hexa โ†’ PTX โ†’ sm_80 / sm_90
   โ”‚   no .cu ยท no nvcc ยท compiler emits โ”‚     practice: `hexa dojo`
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚   hexa build (8-stage strict-lint gate)
                     โ–ผ
              A100 / H100 native

Correctness โ€” byte-equal oracles (max|ฮ”| = 0, FMA-contraction-off recipe)

layer scope measurement
forge substrate RFC 040 device-farr + cuBLAS Dgemm ยท RFC 041 11-op .cu 12 byte-equal fires across the elementwise / reduce / GEMM surface, max|ฮ”| = 0
flame layers rmsnorm ยท attn-fwd ยท attn-bwd ยท silu-gate 4 byte-equal oracle fires, max|ฮ”| = 0
flame ag_tape generic autograd through the same oracles derivation byte-equal, abstraction pays no correctness tax

Performance โ€” measured (g3 / LATTICE_POLICY: real fires, falsifier-gated, no fabrication)

path measurement note
forge BF16-TC mega-kernel (RFC 049 Stage 1, A100) 9.67ร— faster than FP64 cuBLAS @ Llama-7B FFN shape $0.10 fire ยท paradigm verdict from Phase R 14-fire $2.91 campaign
forge Phase R / RFC 060 closure FP64 mega-kernel KILLED (1.8-4.4ร— slower than per-op) ยท BF16 substrate PASS RFC 060 100% closure ยท BF16-TC is the cuBLAS-relative wall path
flame ag_tape d=768 ยท 12-layer (A100) per-step wall recorded ยท PyTorch wall speedup NOT measured prior README "2.95ร— / 1.26-1.76ร— faster than PyTorch eager" was a unit mismatch (full-run / 1-step) โ€” RETRACTED per stdlib/flame/README.md correction 2026-05-19
flame batch-fill SM-fill (CLMConvMoE, H100) โ‰ฅ1.3ร— @B=2, 2.95ร— @B=32 self-speedup (byte-eq B=1 max|ฮ”|=0; B>1 causal-conv seam-only ฮ”) batch FILLS the SMs (B=1 under-fills, util 1-2%); capped ~3ร— by the interpreted per-step glue (token-pack + CE-grad + AdamW โˆ BยทTw) โ€” ~3x cap is STRUCTURAL (serial un-fused FP64 op-DAG); uncap = precision-change OR right-sized GPU, NOT interpreter (#2915)
flame vs PyTorch (H100, batch=1) torch eager ~1656ร— / torch.compile ~2207ร— faster than flame FP64 honest: flame value = byte-exact ยท device-resident ยท no-LLVM, NOT step-rate; kernel-fusion (capture/replay, fwd+bwd) = ~1.0ร— closed-neg

Honest scope: flame's ag_tape + nn_lib + opt_* are functionally complete and byte-equal-verified; forge's farr + cuBLAS Dgemm + 11 .cu substrate is complete with the BF16-TC mega-kernel landing as the cuBLAS-relative wall path. End-to-end flame โ†” PyTorch wall comparison is pending an apples-to-apples re-fire โ€” the substantive cuBLAS-relative win currently sits at the forge layer (BF16-TC 9.67ร— over FP64-cuBLAS on the FFN-shape mega-kernel).

We now own the GEMM too โ€” the device stack is 100 % hexa-ownable (CUDA-OWN campaign)

The substrate above calls cuBLAS for the GEMM itself โ€” the one piece forge did not own. The CUDA-OWN campaign closes that last gap: an env-gated own-GEMM (HEXA_OWN_GEMM family) routes every matmul through a hexa-emit kernel instead of cuBLAS. OFF by default โ†’ cuBLAS stays the default path; flip the env and the entire device GEMM is hexa source โ€” FP64, FP32, and a CUTLASS-grade TF32 WMMA2 tiled kernel.

forge GEMM dispatch (env HEXA_OWN_GEMM / _WMMA2 โ€” OFF == cuBLAS default):
  OFF  โ†’ cuBLAS Dgemm / Sgemm                         (vendor, default)
  ON   โ†’ _hx_k_gemm (FP64)  ยท  _hx_k_sgemm_cm (FP32)  ยท  _hx_k_sgemm_cm_wmma2 (TF32 WMMA2)
         โ””โ”€ launcher precedence WMMA2 > WMMA > TILED > naive โ”€โ”˜   100 % hexa-ownable

Correctness first (own-GEMM vs cuBLAS oracle):

own-GEMM path shape / harness correctness vs cuBLAS oracle verdict
FP64 _hx_k_gemm (clm_prod train, cuBLAS-GEMM-free) D1536 real-corpus train, both arms max|ฮ” CE| = 0.00000 @ 5-dec, CE descends 4.46624 โ†’ 3.64669 ๐ŸŸข F-FUSION-P1-OWN-GEMM-CORRECTNESS
FP32 _hx_k_sgemm_cm (hxqwen14b train, cuBLAS-GEMM-free) M=N=K=2048 R=16 GEMM-bound rel-RMS ~1e-6 (worst 9.70e-7) all outputs, within fp32 tol ๐ŸŸข F-FUSION-P1D-LLM-SGEMM
TF32 _hx_k_sgemm_cm_wmma2 (CUTLASS-grade tiled) M=N=K=2048 + non-tile-multiple bounds rel-RMS 2.6e-4 โ‰ช 3e-3 TF32 bar, bounds-guarded ๐ŸŸข F-FUSION-CUTLASS-GRADE-WMMA

Performance โ€” โ‰ˆ cuBLAS-CLASS, NOT superiority (this is parity, stated plainly):

measurement own-GEMM cuBLAS gap verdict
sustained-loop GPU util (2048ยณ, B200, nvidia-smi) 89.9 % MEAN / 100 % PEAK 88.5 % MEAN / 100 % PEAK both ~90 %, cuBLAS-class occupancy ๐ŸŸข F-FUSION-OWN-GEMM-UTIL
GEMM-iso step-time (2048ยณ WMMA2 vs cuBLAS) 0.77047 ms/iter 0.68 ms/iter ref 1.13ร— of cuBLAS (within ~13 %) ๐ŸŸข F-FUSION-CUTLASS-GRADE-WMMA
LLM full-step (LoRA, M=8192, shape-dispatch + split-K) 454.9 steps/s 565.2 steps/s 1.24ร— of cuBLAS (down from raw 2.24ร—) ๐ŸŸข F-FUSION-THRU-PARITY ยท F-FUSION-SPLITK-SKINNY

The full-step gap closed in two landed steps: skinny-shape dispatch (16ร—16 tiled) took the raw 2.24ร— โ†’ 1.67ร— (~46 % of the gap, F-FUSION-THRU-PARITY), then a split-K skinny GEMM took 1.67ร— โ†’ 1.24ร— (a further 64 % of what remained, F-FUSION-SPLITK-SKINNY) โ€” cumulatively ~80 % of the original 2.24ร— closed.

โš  ARCH CAVEAT โ€” the 1.13ร— iso figure is Blackwell-sm_120-ONLY; native sm_90 H100 is a different story (F-FUSION-WMMA2-SM90-VERIFY #2796 ยท F-FUSION-SM90-DYNSHARED-FIX). The WMMA2 kernel's staging is 57 344 B of shared, which exceeds the sm_90 per-block static __shared__ cap (49 152 B) โ€” so on native Hopper sm_90 the kernel originally did not launch at all (cudaErrorInvalidValue); the 1.13ร— was measured on Blackwell sm_120, whose larger static admit absorbed it. F-FUSION-SM90-DYNSHARED-FIX converts the staging to extern __shared__ (dynamic) + cudaFuncSetAttribute(...MaxDynamicSharedMemorySize, 57344) โ€” this DOES make WMMA2 LAUNCH on sm_90 (compute_cap 9.0 verified, correctness rel-RMS 4.77e-06 PASS). But native-H100 PARITY is NOT restored: on sm_90 the own WMMA2 GEMM measures 1.49 ms/iter @ 2048ยณ = 29.46ร— slower than cuBLAS (0.0507 ms/iter), because the kernel is register/occupancy-bound on Hopper (REG:236/thread โ†’ ~1 block/SM). The Blackwell 1.13ร— did not transfer to native sm_90 โ€” a separate occupancy axis, not the shared-mem launch fix.

โ†ณ occupancy axis TESTED and RULED OUT (F-FUSION-SM90-WARPTILE-RETUNE, closed-negative). A register-reduced WMMA2 variant (_hx_k_sgemm_cm_wmma2_rr, env HEXA_OWN_GEMM_WMMA2_RR, default OFF, math-identical) cut registers 236 โ†’ 128/thread via __launch_bounds__(256,2) + streamed input fragments, doubling occupancy 1 โ†’ 2 blocks/SM โ€” yet on native sm_90 H100 own throughput did not rise: 11.1 โ†’ 10.7 TFLOP/s (-4%, the cuBLAS gap widened 31.4ร— โ†’ 32.5ร—; rel-RMS 4.77e-06 both). So register/occupancy is NOT the binding constraint โ€” the WMMA2 own-GEMM is bound by its inner-loop math pipeline (per-element software __float_to_tf32, depth-1 cp.async prefetch, scalar epilogue), not block count. Closing the ~31ร— sm_90 gap needs a cuBLAS-class TC mainloop rework (deep pipelining, ldmatrix/mma.sync w/o per-element TF32 rounding, register-blocked accumulation), a multi-session pipeline rewrite โ€” not a one-knob retune.

โ†ณ cuBLAS-class mainloop ATTEMPTED โ€” 3/4 levers landed, mma.sync is the ceiling (F-FUSION-SM90-CUBLAS-MAINLOOP, closed-negative). A reworked own-GEMM (_hx_k_sgemm_cm_wmma2_cb, env HEXA_OWN_GEMM_WMMA2_CB, default OFF) landed three of the four cuBLAS-class levers: L1 hardware-TF32 mma.sync.m16n8k8 (round fused via cvt.rna.tf32.f32, dropping the per-element software __float_to_tf32 sweep), L2 a deep multi-stage cp.async pipeline, and L4 a register-blocked epilogue (4 D-regs/thread written straight to col-major C). The fourth, L3 ldmatrix, did NOT land โ€” ldmatrix.x4 is a 16-bit-element op and the TF32 operands are 32-bit (named residual: a 32-bit ldmatrix.trans swizzle). On native sm_90 H100 (2048ยณ) the CB variant is numerically exact (rel-RMS 0.000e+00, bit-equal to the cuBLAS-TF32 oracle on this seed) and +3.4% faster than the parent (11.17 โ†’ 11.55 TFLOP/s, gap 30.4ร— โ†’ 29.4ร— โ€” only ~3.4% of the cuBLAS gap closed). The on-pod stage-depth sweep saturated at 2 stages (a second independent confirmation that globalโ†’shared latency is not the bottleneck). Finding: the binding constraint is the mma.sync warp-level instruction class itself โ€” on Hopper, cuBLAS reaches TC peak via wgmma.mma_async (warpgroup-level async MMA) fed by TMA (cp.async.bulk.tensor), a different instruction class mma.sync cannot reach regardless of mainloop tuning. The remaining ~29ร— is a wgmma + TMA rewrite (a CUTLASS-3.x-class sm_90a kernel), not an mma.sync mainloop retune. Honest scope: parity-seeking, cuBLAS = roofline, no superiority claim.

โ†ณ wgmma + TMA rewrite โ€” FEASIBILITY PASS, layout residual (F-FUSION-SM90-WGMMA-TMA). The named wgmma/TMA lever is now build- and run-feasible on native sm_90 H100: with -arch=sm_90a (nvcc 12.6), wgmma.mma_async executes correctly (f16 probe: nonzero 2048/2048, sum 1962.49 vs ref 1956.87) and the entire Hopper async PTX surface compiles โ€” cuTensorMapEncodeTiled (TMA) + cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes + mbarrier.*. This converts the prior F-FUSION-ATTN-WGMMA-WALL hardware-blocked closed-negative (same kernel silently NOP'd on Blackwell sm_120) into testable-on-Hopper. The own-source emit path is NOT the blocker. The standalone TF32 warpgroup mainloop builds + launches but is not yet bit-correct (rel-RMS 1.309e+00 vs 3e-3); an isolated descriptor sweep + a structured-input diagnostic proved the binding residual is the wgmma no-swizzle 8ร—16B core-matrix shared-memory layout of the operands (NOT the instruction, descriptor offsets, or the verified-correct epilogue registerโ†’C mapping). No own GFLOP/s is reported โ€” g5 forbids perf on a wrong-result kernel; parity NOT measured. Kit: self/native/wgmma/. Honest scope: parity-seeking, cuBLAS = roofline, no superiority claim.

โ†ณ swizzle wedge โ€” residual PINNED to the B-operand K-core stride, NOT a permutation (F-FUSION-SM90-WGMMA-SWIZZLE, closed-negative this pass). An on-hardware reverse-engineering (distinct-ramp operand + one-hot selector; native H200 sm_90a, nvcc 12.6, pod DESTROYED leak 0) isolated the rel-RMS 1.309 residual to two superimposed defects in the wgmma B no-swizzle core-matrix layout: (1) a K-stride collapse โ€” for contraction index k=1..7 wgmma re-reads B's K=0 core-matrix (the decoded kโ€ฒ is pinned to 0 for every K-selector), the dominant โ‰ˆโˆš2 error; (2) an N-octet interleave (output col n reads logical col โ‰ˆ4n within an 8-wide octet). A >2300-config exhaustive sweep โ€” A/B shared layout โˆˆ {plain row-major, two 8-row-strip core tilings, col-major-B} ร— descriptor (LBO,SBO) โˆˆ {16โ€ฆ512} for both operands ร— 3 epilogue register-maps, fault-isolated per-process โ€” found no config below rel-RMS 1.36, deterministically ruling out the hypothesis that the residual is a plain-layout/offset/epilogue permutation. The fix requires the genuine CUTLASS GMMA::Layout core-matrix builder (B's 8 K-values forming a contiguous 8-row core-matrix, descriptor LBO = one-core-matrix stride, swizzle field matched to the TMA cuTensorMapEncodeTiled swizzle mode), verified FIRST on the single-tile decode probe to kโ€ฒ==KSEL identity before any 2048ยณ run. Kit: self/native/wgmma/{wgmma_tf32_decode,wgmma_tf32_bdecode,wgmma_tf32_full}.cu. Still parity-seeking, no perf number on a non-bit-correct kernel (g5).

โ†ณ GMMA::Layout core-matrix builder โ€” wgmma TF32 is now BIT-CORRECT on native sm_90a (F-FUSION-SM90-WGMMA-GMMA-LAYOUT, ๐ŸŸข numerical). The swizzle is SOLVED. The root cause of the >2300-config dead end was a single wrong constant: a wgmma core matrix is 8 rows ร— 16 bytes = for TF32 (4 B/elem) 8 rows ร— 4 ELEMENTS, not the 8ร—8 K-strip the prior kit assumed โ€” that 8-vs-4 mismatch is both pinned defects (K-stride collapse + N-octet interleave). Implementing the real CUTLASS-3.x GMMA INTER (no-swizzle) 8ร—4 core-matrix layout (gmma_phys = (strip*2+kcore)*32 + sr*4 + kc, descriptor start[0,14) LBO[16,30) SBO[32,46) layout_type[62,64)=INTERLEAVE, LBO=128 B / SBO=256 B inter-core strides) made the single 64ร—64ร—8 wgmma tile bit-exact (W2 rel-RMS 0.000e+00, native H100 SXM cc 9.0, -arch=sm_90a, nvcc 12.6, pod DESTROYED leak 0). Scaling to the full 2048ยณ GEMM revealed a second, separate defect โ€” a K-loop async-proxy ordering bug (wgmma reads shared through the async proxy, which ordinary __syncthreads does not order against generic stores; non-deterministic 3e-2โ€ฆ1e-1 past Kโ‰ˆ1536). Adding fence.proxy.async.shared::cta after staging makes the K-loop bit-exact at 2048ยณ (W3 own-vs-cuBLAS-TF32 & own-vs-CPU-f64 both rel-RMS 0.000e+00, deterministic). Parity is now MEASURABLE (g5 satisfied): the naive single-wgmma-per-block kernel runs 20.2 TFLOP/s @ 2048ยณ (cuBLAS-TF32 357.5, 17.67ร— off, PARITY=NO); a first pipeline tune (wide-N TN=128, 2 wgmma/K-step reusing A) nearly doubles it to 38.0 TFLOP/s (9.35ร— off), still bit-exact (TN=256 is slower โ€” register/occupancy bound). The own-GEMM is now provably CORRECT on sm_90a; the remaining gap is a pure latency-hiding residual โ€” a full warp-specialized TMA multi-stage CUTLASS mainloop (cp.async.bulk.tensor producer + wgmma consumer, deep pipeline), a multi-session build โ€” NOT the layout (solved) and NOT correctness (bit-exact 2048ยณ). cuBLAS = roofline, no superiority claim. Kit: self/native/wgmma/{wgmma_tf32_gmma,wgmma_tf32_gemm2048,wgmma_tf32_gemm_w5,wgmma_tf32_gemm_w5b}.cu.

โ†ณ sm_90a wgmma+TMA own-GEMM (Hopper) โ€” the latency-hiding ladder W6โ†’W10 โ†’ then the canonical-atom leap to TF32 cuBLAS-PARITY (F-FUSION-SM90-WGMMA-OG16/OG17, ๐ŸŸข PARITY). With the layout solved and the kernel bit-exact, the residual is pure async-pipeline engineering, and it is now being walked down rung-by-rung on native sm_90a H100 โ€” each rung bit-exact (rel-RMS 0.000e+00 vs the cuBLAS-TF32 oracle), perf reported only because the kernel is bit-correct (g5):

sm_90a wgmma+TMA own-GEMM ladder (TF32, @4096ยณ, native H100 sm_90a, bit-exact rel-RMS 0):
  rung  lever                                   own TFLOP/s   gap vs cuBLAS-TF32 (~431)   occupancy
  W6    async cp.async pipeline                    50.7              8.39ร— off               โ€”
  W8    HW-TMA single-elected-thread producer      66.5              6.44ร— off            2 CTA/SM
  W10   composed swizzle-decode (permute-free)     70.7              6.09ร— off            2 CTA/SM
  OG16  canonical-atom (global re-encode ยท band-free) 264.7           1.37ร— off            2 CTA/SM
  OG17  + relaxed-pipeline ping-pong (S=2048)      280               1.24ร— off  โ˜… PARITY  2 CTA/SM
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  cuBLAS-TF32 = ~431 TFLOP/s roofline โ€” OG17 CROSSES PARITY (โ‰ค1.3ร—) @ S=2048, bit-exact (NOT beaten, no superiority)
  FP16 port (OG18/OG19): own 504 TFLOP/s ยท 13.37ร—โ†’1.64ร— off cuBLAS-FP16 (recipe generalizes; 2ร— roofline = the ceiling)

W6's async pipe (50.7) gave way to W8's hardware-TMA producer โ€” a single elected thread drives cp.async.bulk.tensor while the TMA engine does the globalโ†’shared copy, freeing the producer warpgroup and shrinking the CTA to 2 CTA/SM (66.5 TFLOP/s, 6.44ร— off, F-FUSION-SM90-WGMMA-W8). W10 then composes the FP32 SWIZZLE_128B law with the GMMA-INTER 8ร—4 core packing into a software composed-decode, eliminating the per-K-step permute scratch (SASS 28 STS โ†’ 0) so the kernel stays permute-free at full 2 CTA/SM: 70.7 TFLOP/s @4096ยณ, 6.09ร— off cuBLAS-TF32 (430.8), bit-exact rel-RMS 0 (F-FUSION-SM90-WGMMA-W10, PRs #2841/#2847). PARITY IS NOT ACHIEVED โ€” at 6.09ร— off, cuBLAS-TF32 remains the roofline and there is NO superiority claim. This is honest parity-seeking progress: the win is that the wgmma+TMA own-source path is bit-exact, and the named residual is async-pipeline engineering (larger tiles / warp-specialization / ping-pong, per research note #2846) โ€” NOT a missing algorithm and NOT a correctness gap. Kit: self/native/wgmma/{wgmma_tf32_warpspec,wgmma_tf32_w10}.cu.

โ†ณ OG16โ†’OG17 โ€” TF32 own-GEMM CROSSES cuBLAS PARITY on native sm_90a (F-FUSION-SM90-WGMMA-OG16/OG17, ๐ŸŸข PARITY, PRs #2866/#2870). The named async-pipeline residual is now CLOSED at S=2048. OG16 found the real lever the W-ladder was missing: re-encode A/B in global memory into the canonical CuTe Layout_K_SW128/gmma-INTER atom + a no-swizzle TMA, so the SMEM tile is the wgmma operand (descriptor-direct) โ€” the 32 KB in-kernel decode band is GONE, not just removed-then-decoded. That dissolved the decode-bandโŠฅoccupancy wall (OG11โ†’OG15 had FALSIFIED a descriptor-field-only fix) and took own 70.2 โ†’ 264.7 TFLOP/s, 6.09ร— โ†’ 1.37ร—, smem 96โ†’64 KB @ 2 CTA/SM, bit-exact rel-RMS 0 @2048ยณ&4096ยณ. OG17 then added the relaxed-wait_group 1 ping-pong pipeline (next K-slab's wgmma issue overlaps this slab's tensor-core drain, mbarrier-ring safe, 2 CTA/SM held) โ†’ own 280 TFLOP/s, ratio 1.24ร— = PARITY (โ‰ค1.3ร—) @ S=2048, bit-exact, โ‰ˆ81 % of cuBLAS-TF32. The 'native-H100 own-GEMM can't reach cuBLAS-TF32' wall is CLOSED. Honest residual: @4096 stays 1.56ร— (a 256-tile register/ptxas wall, closed-neg). OG18/OG19 port the SAME recipe to FP16/BF16 (re-derived 8ร—8/128B f16 atom) โ€” own 61 โ†’ 504 TFLOP/s, 13.37ร— โ†’ 1.64ร— off cuBLAS-FP16 (+8.2ร—, bit-exact rel_rms 0): the canonical-atom recipe generalizes across dtype, but FP16 PARITY is NOT crossed (1.56โ€“1.64ร—) because cuBLAS-FP16's roofline is ~2ร— TF32 and the residual is occupancy/pipeline-depth on that doubled roofline. cuBLAS = roofline throughout, parity-seeking, no superiority claim. The reusable recipe (canonical-atom ยท relaxed-pipeline ยท the under-fill/saturated regime law) is folded into commons.tape g82 + the hexa dojo (PR #2869). Kit: self/native/wgmma/{wgmma_tf32_og16,wgmma_tf32_og17,wgmma_fp16_og18}.cu.

Util is a workload-size property, not a defect (F-FUSION-D2-RIGHTSIZED): the byte-identical D1536 own-GEMM step that under-fills an idle H100 to ~13 % MEAN (median 2 %) saturates a right-sized RTX 5070 to 98.00 % MEAN (every sample 98 %, SM 98 %, compute-bound) โ€” the 2048ยณ large shape gives 99 % on the same 5070. Low util on the H100 is the H100 being too big for a D1536 model, not a codegen flaw; given a GPU sized for the workload, util is at the saturation ceiling.

The sizing axis is now measured-exhausted on the real full step โ€” util-GREEN is structural, not a knob (F-FUSION-M3 ยท F-FUSION-M5). Two falsifiers ran on the real clm_prod training step (not a standalone GEMM) on an idle H100: scaling the model D 1536โ†’4096 makes util worse (MEAN 10.57 % โ†’ 6.64 %, F-FUSION-M3), and growing the batch B 1โ†’32 (GEMM M-dim 512โ†’16384) leaves util flat at ~0.45 % (F-FUSION-M5, MEDIAN 0 % throughout). Neither bigger-model nor bigger-batch fills the GPU โ€” the wall is the serial per-step structure: FP64-GEMM bursts separated by per-position glue idle the device between launches, so MEAN util is a duty-cycle invariant to workload height. The only measured lifts are a precision change (TF32 megakernel, +3โ€“5 pp below) or a right-sized GPU (the RTX 5070 above) โ€” not scale. This is the honest closure of the util-99 % north-star: on a big GPU at small D/B, 99 % is unreachable by sizing.

Honest limits (g5): the own-GEMM is โ‰ˆ PARITY, not superiority โ€” it is still 1.13ร— (iso, Blackwell sm_120) to 1.24ร— (full-step) slower than cuBLAS, never faster. The 1.13ร— iso parity is sm_120-specific: on native Hopper sm_90 the same WMMA2 GEMM is 29.5ร— off cuBLAS (occupancy-bound) and originally would not even launch โ€” see the ARCH CAVEAT above. The README's existing honesty that "a single huge GEMM already ties cuBLAS at roofline" still holds โ€” owning the GEMM does not change that ceiling, it just makes the ceiling hexa-owned. And the BF16-TC 9.67ร— above is a separate dtype axis (BF16-TC vs FP64-cuBLAS); it is NOT the own-vs-cuBLAS same-dtype comparison reported here. The win of owning the GEMM is not speed โ€” it is capability cuBLAS structurally cannot offer.

โญ The trade, stated plainly โ€” you pay a small parity tax to buy four capabilities cuBLAS cannot give

A ~13โ€“24 % same-dtype speed tax buys a column cuBLAS leaves empty. These are capability wins (what is possible), not speed wins โ€” and they are exactly what hexa's domains (reproducible science, byte-equal CLM/RTSC, megakernel fusion) require:

capability you gain own-GEMM cuBLAS proof / where it pays off
๐ŸŽฒ Determinism (byte-exact) bit-reproducible by construction โ€” you fix the reduction order DMMA accumulation order is vendor-"unspecified", drifts across GPU generations, un-matchable from outside the byte-eq capstone (max|ฮ”| = 0) ยท audit-grade reproducible training ยท clean A/B (zero GEMM noise) ยท multi-GPU bit-consistency
๐Ÿงฉ Fusion (megakernel-resident) callable inside a persistent / cooperative kernel โ€” GEMM becomes just another resident op cannot be nested in a device kernel (host library call) โ€” forces "stop, write HBM, hand off" whole-step megakernel (F-FUSION-M2) ยท 11-op fwd in 1 launch (F-FUSION-LAUNCH-AMORT) ยท FlashAttn-style fused attention
๐Ÿ”ข FP64-exact + custom epilogue arbitrary precision + fused epilogue in one kernel; FP64 GEMM is a byte-exact oracle FP64 epilogue fusion absent; IEEE-float only clm_prod FP64 train max|ฮ”CE| = 0 (oracle baseline) ยท RTSC/DFT-grade FP64 science ยท non-IEEE dtypes (posit ยท n=6 lattice)
๐Ÿ”“ Ownership (no vendor lock) 100 % hexa source โ†’ PTX โ†’ SASS, no LLVM, no C-transpile, single binary closed black box, NVIDIA-only, multi-GB libtorch to ship edge / offline single-binary deploy ยท SASS-level kernel research ยท the ๐Ÿ’Ž identity

Net: "โ‰ˆ cuBLAS speed (โˆ’13โ€“24 % tax) + determinism + fusion + FP64-exact + ownership that cuBLAS cannot express." For a workload that just wants a fast standalone matmul, cuBLAS is simpler and faster. For hexa's targets โ€” reproducible, fusible, owned, FP64-correct โ€” the trade is decisively worth it, and the byte-eq megakernel capstone is the existence proof: it is only reachable on the own-GEMM stack.

Where it beats cuBLAS-using stacks structurally (whole-program fusion ยท cuBLAS cannot express)

cuBLAS ships a champion part (the GEMM kernel itself, already at roofline), but cannot fuse adjacent ops โ€” each op pays a separate kernel launch + a full HBM round-trip. hexa codegen sees the whole expression and emits one kernel that keeps intermediates in registers / shared memory:

cuBLAS-using stack (current default โ€” 3 ops = 3 launches, 3 HBM round-trips):
  โ”Œโ”€โ”€GEMMโ”€โ”€โ”         โ”Œโ”€โ”€biasโ”€โ”€โ”         โ”Œโ”€โ”€GeLUโ”€โ”€โ”
  โ”‚ launch โ”‚ โ†’ HBM โ†’ โ”‚ launch โ”‚ โ†’ HBM โ†’ โ”‚ launch โ”‚ โ†’ HBM
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

hexa fusion (whole-program โ€” one kernel, registers/shmem reused):
  โ”Œโ”€โ”€โ”€โ”€ GEMM + bias + GeLU โ”€โ”€โ”€โ”€โ”
  โ”‚  1 launch ยท 1 HBM write    โ”‚ โ†’ HBM            (F-FUSION-EPILOGUE-GEMM-BIAS-GELU)
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                  66.667 % launch + HBM-write reduction

The same mechanic generalises: GEMM-epilogue, norm surface, attention block, autoregressive decode chain โ€” every place where cuBLAS forces "stop the GEMM, write to HBM, hand off to the next op" hexa can keep the value in registers.

Owning the GEMM unblocks the whole-step megakernel โ€” both walls are now closed, and it is realized byte-equal on the real training step. The strongest form of the fusion above is a single persistent / cooperative kernel that holds the whole step in registers + shared memory across a grid-wide barrier. Two structural walls capped this. Wall 1 โ€” the cuBLAS-call wall: a persistent kernel cannot call cuBLAS (you can't make a host library call from inside a running device kernel). Now that the GEMM is our own device kernel (_hx_k_sgemm_cm_wmma2, correctness-verified above), the persistent kernel calls our GEMM in-line โ€” the GEMM stops being an un-fusable cuBLAS hand-off and becomes just another op the megakernel keeps resident. Wall 2 โ€” the GroupNorm full-Y reduction: the GN reduction over all TยทC could not previously live inside the persistent kernel without re-associating the FP64 sum (breaking byte-eq). A grid-sync cooperative kernel (cudaLaunchCooperativeKernel + this_grid().sync()) now closes it: the reduction stays single-thread sequential, so it is byte-eq max|ฮ”|=0 vs the sequential GroupNorm oracle, deterministic (F-FUSION-MEGAKERNEL-GN-GRIDSYNC, #2845, A100-confirmed). With both walls closed the whole-step glue megakernel is fully realized โ€” 100 % hexa-owned, cuBLAS-call-free. Honest scope (g5): closing Wall 2 is a CLOSED-NEGATIVE on util โ€” its value is ownership / structural completeness, NOT a util or throughput win. byte-eq forces the reduction single-thread, so the cooperative launch buys zero reduction-parallelism (idle threads wait at the barrier); no util/perf superiority is claimed. Landed + measured on the real clm_prod step:

megakernel realization what is fused into ONE cooperative launch result verdict
whole-step megakernel (F-FUSION-M2) fwd + ce-grad + bwd + AdamW (17 per-param AdamW launches โ†’ 1) [FULLSTEP-FIRED], CE converges, real-step util +3.4 pp ๐ŸŸข F-FUSION-M2-FULLSTEP-MEGAKERNEL
TF32 fwd megakernel (F-FUSION-P1) all fwd GEMMs pulled in (no cuBLAS), TF32 own-GEMM util 29.0 โ†’ 34.5 % MEAN (+5.5 pp), CE descends ๐ŸŸข F-FUSION-P1-TF32-MEGASTEP
byte-eq megakernel CAPSTONE (F-FUSION-P1B-aโ€ด) device-resident fwd, own-GEMM, cooperative โ€” vs eager reference max|ฮ”| first_ce = last_ce = 0.000000e+00 (17-digit CE bit-identical), util +4 pp while byte-identical ๐ŸŸข F-FUSION-P1B-APRIME3-ASYNCOFF (#2792)
both walls closed (Wall 1 cuBLAS-call via own-GEMM ยท Wall 2 GN full-Y reduction via grid-sync) grid-sync cooperative GroupNorm โ€” last un-fusable op now megakernel-resident byte-eq max|ฮ”|=0 vs sequential GN, deterministic ยท CLOSED-NEG on util (win = ownership/completeness, not a perf lift) ๐ŸŸข F-FUSION-MEGAKERNEL-GN-GRIDSYNC (#2845)

The capstone is the payoff this section gestures at, proven: a +util cooperative megakernel that is byte-equivalent to the eager reference (max\|ฮ”\| = 0) โ€” something impossible with cuBLAS twice over (cuBLAS can neither be nested in a persistent kernel nor bit-matched from outside, its DMMA accumulation order being vendor-"unspecified"). The five-layer hunt for the last ~1e-1 non-determinism (F-FUSION-B6 โ†’ P1B-a'') closed on the true cause: an async cross-stream race, not a transcendental or GEMM-order issue โ€” F-FUSION-N1N2 proved every glue kernel is deterministic in isolation and that HEXA_CUDA_ASYNC=0 makes the device forward bit-reproducible (CE 4.4662394504526679 ร—5 identical). For byte-reproducible training, set HEXA_CUDA_ASYNC=0 (the synchronous, single-ordered-stream path).

finding reduction / win tier
F-FUSION-EPILOGUE-GEMM-BIAS-GELU 66.667 % launch + HBM-write reduction (3 launches โ†’ 1) @ LLaMA-7B FFN shape, ptxas-clean sm_80 ๐Ÿ”ต structural-formal
F-FUSION-LAUNCH-AMORT 5-op chain โ†’ 1 launch / 3 HBM transfers vs separate-op 5 launches / 11 transfers ๐Ÿ”ต + $0 deterministic oracle
F-FUSION-AXISA-BREADTH (norm surface) LayerNorm 66 % ยท RMSNorm 59 % ยท Softmax 65 % ยท SwiGLU 63 % ๐Ÿ”ต structural-formal
F-FUSION-ATTENTION-FLASH single-kernel fused attention (QยทK ยท softmax ยท V) ๐Ÿ”ต + wall ruled-out
ยง5j Custom reductions โ€” LogSumExp 1-kernel (#1657) numerically-stable max-shift + exp + log + sum in one kernel, silicon-validated rel_err 1.7e-10 ๐ŸŸข SUPPORTED-NUMERICAL

๐ŸŽฏ Who benefits โ€” 7 user personas (the pain โ†’ the gain)

cuBLAS-using stacks ship a champion part (the GEMM kernel, already at roofline). hexa wins where the part isn't the bottleneck โ€” the chain around it is. Whether that helps you depends on which pain you actually carry:

persona pain you carry what hexa gives
๐Ÿงช LLM trainer / inference engineer attention ยท norm ยท decode are memory- / launch-bound โ€” stuck on top of PyTorch fusion strikes that region directly โ€” 3-op chain โ†’ 1 launch + 1 HBM write (66 % โ†“) ยท FlashAttn-style single kernel
๐Ÿ”ฌ GPU kernel researcher cuBLAS is a black box โ€” wants SASS-level visibility but can't get it source โ†’ PTX โ†’ SASS visible end-to-end ยท cubin lives in the repo
๐Ÿ“ฆ Single-binary deployer (edge / embedded / offline) can't ship Python + libtorch (multi-GB) to the target native arm64 / x86_64 single binary ยท no Python in the trained artifact
๐Ÿ”ข Non-IEEE arithmetic (posit ยท interval ยท n=6 lattice) cuBLAS is IEEE-float only custom-dtype codegen โ€” new arithmetic rides the same fusion path
๐Ÿง  Autograd debugger PyTorch C++ Autograd is a black box, can't step through it ag_tape is all hexa source โ€” read it line by line
๐ŸŽฏ Byte-equal correctness (science ยท reproducibility) PyTorch run-to-run drift is common byte-equal oracles + FMA-contraction-off recipe, max|ฮ”| = 0
โšก Fast codegen iteration hand-CUDA hell โ€” rewrite the fusion every time the compiler fuses for you โ€” one @gpu_kernel annotation
Where does hexa's fusion gap land hardest?
                cuBLAS-using stack โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  (huge standalone GEMM is fine โ€” can't beat)
                    โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Intersection where hexa fusion wins   โ”‚
  โ”‚   โ‘  many memory-bound patterns         โ”‚  โ† LLM training / inference
  โ”‚   โ‘ก Python-free deploy                 โ”‚  โ† edge ยท embedded ยท offline
  โ”‚   โ‘ข correctness OR visibility needed   โ”‚  โ† research ยท science ยท repro
  โ”‚   โ‘ฃ long chains (decode/optim/AdamW)   โ”‚  โ† training loop
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿณ Where fusion fires โ€” memory-hierarchy asymmetry

GPU register ~1 cycle vs HBM ~600 cycles. cuBLAS writes the result to HBM after every op so the next op can read it back; fusion keeps the value in registers.

scenario why fusion wins measured
GEMM + elementwise epilogue (bias ยท ReLU ยท GeLU ยท dropout) GEMM output is a large tensor โ€” next op reuses it immediately F-FUSION-EPILOGUE 66.7 % โ†“
norm surface (LN / RMSNorm / Softmax / SwiGLU) reduce + immediate-neighbor reuse ยท norm is memory-bound AxisA LN 66 % ยท RMS 59 % ยท SM 65 % ยท SwiGLU 63 %
Attention block (QยทKแต€ ยท softmax ยท V) giant intermediate attention matrix โ†’ avoiding HBM round-trip is the win F-FUSION-ATTENTION-FLASH ๐Ÿ”ต
Small-op chain (LLM autoregressive decode ยท AdamW step) launch overhead dominates over compute F-FUSION-LAUNCH-AMORT 5-op โ†’ 1 launch
fusion gain  =  (chain length)  ร—  (memory-bound-ness)  ร—  (intermediate-tensor size)

Honest scope on where it doesn't: a single huge GEMM (already compute-bound, ties cuBLAS at roofline) ยท a lone op (nothing to fuse) ยท very small GEMMs (launch-bound is the real problem, not fusion).

One line: cuBLAS = a one-dish specialist (master of the stew). hexa fusion = a one-pan dinner (multiple steps in sequence on the same heat). Users whose workload's time distribution overlaps the four scenarios above land on hexa's real gap.

Detail: stdlib/flame/README.md (canonical perf table + RETRACTION note) ยท stdlib/flame/PERF.md ยท stdlib/flame/PLAN.md (campaign log + cycle ledger) ยท self/forge/PLAN.md ยท self/forge/PARADIGM.md (Phase R measured verdicts) ยท GPU.md ยง1h-1o fusion-moat fires ยท GPU.easy.md (friendly persona sidecar) ยท state/anima_handoff_2026_05_19.md (integration recipe).


Status

The closure round's fixed points, with witnesses on disk:

  • 41ecfb97 โ€” RFC-020 A4 enum-payload codegen restored in SSOT codegen_c2.hexa (regen-safe; test_enum_payload_full 15/15 codegen + interp)
  • 46016739 โ€” builtin/method taken-by-value โ†’ __hxthunk_<name> codegen (fixes hexa_callN(<builtin>) undeclared) + un-doubled hexa_cc.c
  • 6c0fbac7 โ€” exec_stream_kill(h) runtime builtin (fork+setpgid stream child, SIGTERMโ†’graceโ†’SIGKILL)
  • 4725c619 โ€” stdlib/semver.hexa โ€” SemVer 2.0.0 parse/compare/range-satisfies (test_semver 110/110)
  • df9e7f6b โ€” install-relative stdlib/ discovery + HEXA_INSTALL_DIR passdown (use "stdlib/*" works without HEXA_LANG/HEXA_STDLIB_ROOT)
  • 0ba5fd7d โ€” shell-builtin absorption: pwd โ†’ cwd()/getcwd(), ls โ†’ list_dir() intrinsics (absorbed 638โ†’752, pending 197โ†’83)
  • 731f41d6 โ€” hexa cc resolves hexa_cc.c/SSOT/-I via $HEXA_LANG > install_dir > ./self (works out-of-tree)
  • a5de44e2 โ€” self/stdlib/law_io.hexa selftest main() โ†’ tool/law_io_selftest.hexa (u_main collision on flatten)
  • dae438ee โ€” ~/.hx/bin/hexa_real re-promoted from HEAD 46016739 (sha cd817981โ€ฆ)
  • 774c5d32 / 4f5f8f07 โ€” stage-1 punch-list v2: A1+A2 host re-promote โ†’ #13 RSS re-probe peak ~782 MB (vs 3 510 MB) โ€” P0 stage-1 OOM closed at current scale
  • 571df583 / a8ff675b โ€” SPEC ยง19/ยง20 reconcile + Gap-15 close-out
  • 340c3788 / 5ddcf2a9 โ€” wilsonโ†”hexa-lang closure (VERIFIED โ€” hexa build core/main.hexa โ†’ wilson 0.0.1) + SPEC closure-round fold-in

Snapshot derived from git log on main; full tables at SPEC.yaml::phases_completed_2026_05_09 and SPEC.yaml::phases_completed_2026_05_11_closure.


Decisions (the spine)

Six choices that shape everything else, pinned in SPEC.yaml:

  1. Native compiled, direct codegen โ€” no LLVM, no C-transpile. The tree-walking interpreter is retired: the self-host stage reached a byte-equal fixed point, and hexa run compiles then executes.
  2. Atlas static-baked into the compiler binary โ€” ATLAS_HASH pinned, drift handled by CI auto-rebuild. Runtime atlas-load cost: 0 ms.
  3. Strict compile-time fatal lint โ€” Python SyntaxError + TypeScript strict model. S0โ€“S5 + S8 always fatal. No --unsafe. No HEXA_STRICT=0.
  4. @grace is the only opt-out โ€” @grace(HXxxxx, until="...", reason="...") per site, every site emits HX9000 at every compile, CI requires Acked-grace: trailer.
  5. ฮต self-proof โ€” verified functions auto-register as atlas L[*] theorems; tombstones cascade on prover upgrade; HX1099 fires on citing a tombstoned law.
  6. ENGLISH ONLY diagnostics โ€” catalog, hexa explain, stdlib docs. RFCs and meta docs may stay bilingual.

Full record: 14+ pinned decisions, all traceable to RFC-017 through RFC-020.


Install

# Single-line bootstrap โ€” installs `hexa` + `hx` (the package manager) + atlas
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/dancinlab/hexa-lang/main/install.sh)"

# Verify
hexa --version
hx --version

The installer drops hexa, hx, hexa_ld, and the atlas seed into ~/.hx/; binary path is added to your shell's PATH via the relevant rc file. Self-update: hexa self-update (compares against the published manifest, atomic swap of ~/.hx/bin/hexa_real).

Run

hexa parse <file>.hexa                 # cheapest signal โ€” syntax + reserved-word + @plugin attr check
hexa build <entry>.hexa -o build/X     # full pipeline โ†’ static binary
hexa cc <file>.hexa -o build/X.o       # just lower โ†’ object (HIR โ†’ MIR โ†’ LIR โ†’ emit)
hexa run <file>.hexa [<args>...]       # compile then execute a single file
hexa explain HX8004                    # what does this diagnostic mean
hexa atlas lookup <id> | --prefix=<p>   # read atlas node(s) โ€” embedded.gen.hexa SSOT
hexa atlas register --from-verify <fn> <args> <v>   # verify IN-PROCESS โ†’ fold node into embedded.gen.hexa
hexa atlas export [--out PATH]          # export live atlas โ†’ portable .n6 (n6 = export-only)
hexa drill --seed "<expr>"             # OUROBOROS smash โ†’ ... โ†’ absorb cycle

hx install <package>                   # install a hexa package by name (looks up dancinlab GitHub by default)
hx update                              # pull updates for all installed packages
hx list                                # what's installed under ~/.hx/bin/

hexa run compiles a file then executes it in one shot โ€” convenient for single-file scripting. Release-grade builds go through hexa build, which produces a reusable static binary.

Compile speed

hexa cc now emits #include "runtime.h" by default and the precompiled runtime.o is linked instead of re-codegened per build. On bench/*: 28-program avg 8.41ร— user-time vs the old #include "runtime.c" path (peak 17.25ร— on small-to-medium user code where runtime.c was the dominant per-build cost). Repro: bin/hexa-fast bench <file>.hexa. Full history at COMPILE-SPEED.tape (architecture) and COMPILE-SPEED.log.tape (measurement events).

bin/hexa-fast <src.hexa> <bin>          # explicit compile (uses runtime.h + runtime.o cache)
bin/hexa-run  <src.hexa> [args...]      # compile-or-reuse-cached + exec (drop-in for `hexa run`)
bin/hexa-fast bench <src.hexa>          # show baseline vs new-path A/B for any file
bin/hexa-fast clean                     # wipe ~/.hexa-cache

Architecture (the cooking metaphor)

From doc/atlas_lint_easy_explainer.md:

The atlas is a ์‚ฌ์ „ โ€” a single shared dictionary of primitives (P), connections (C), laws (L), and errors (E). 60,760 lines, 4.2 MB, unconditionally binary built-in (compile-time embedded); new laws land via GitHub PR.

The compiler is a ์…ฐํ”„ (chef) โ€” it has the entire ์‚ฌ์ „ memorized. It does not phone the library mid-recipe. When you hand it a .hexa file, the chef checks every ingredient, unit, and citation against the atlas it already knows by heart.

The strict lint is the ํ’ˆ์งˆ ๊ฒ€์‚ฌ๊ด€ (QC inspector) โ€” it stands at the kitchen door. One missing citation, one โ„-vs-โ„• mismatch, one orphan unit, and the dish is rejected before the stove turns on. There is no "we'll fix it after." There is no binary.


Strict-lint stages

Eight checks, six always fatal, two opt-in via annotation:

  • S0 parse โ€” syntax / lex. No surprises.
  • S1 resolve โ€” every P[*], C[*], L[*], E[*] exists in the atlas.
  • S2 bind โ€” every name resolves to a real binding.
  • S3 type โ€” nominal types and generics.
  • S4 domain โ€” โ„ / โ„• / โ„ค / โ„‚ consistency.
  • S5 units โ€” dimensional analysis. No "distance + time."
  • S6 equational โ€” opt-in via @verify; canonical-form check + sample counter-example. In-house prover v0, no Z3.
  • S7 proof โ€” opt-in via @prove; reserved for the in-house prover only.
  • S8 citation โ€” formula-bearing functions must cite atlas L[*] (HX8004). ๊ณต์‹ ์—†์œผ๋ฉด ๊ฑฐ์ ˆ.

Atlas SSOT cycle (ฮต self-proof)

   @verify fn f(...) { ... }                     โ† author writes a theorem
            โ”‚
            โ–ผ
      compile-time prover  (S6, equational + sample-eval, in-house only)
            โ”‚
            โ–ผ
      hexa atlas export                โ† .n6 export artifact (interop / inspection)
            โ”‚
            โ–ผ
      GitHub PR into embedded.gen.hexa โ† the atlas SSOT (binary built-in)
            โ”‚           โ”œโ”€โ–บ fingerprint dedup โ†’ register as alias
            โ”‚           โ””โ”€โ–บ id collision     โ†’ first-wins + warning
            โ–ผ
      compiler build re-embeds atlas   โ† live atlas grows (no runtime overlay)
            โ”‚
            โ–ผ
      prover upgrade                   โ† retroactive sweep (compiler/discover/cascade.hexa)
            โ”‚
            โ–ผ
      tombstone failing L nodes + cascade dependents
            โ”‚
            โ–ผ
      auto-PR (tool/auto_pr_tombstone_sweep.hexa) โ†’ human review

Citing a tombstoned L[id] fires HX1099 and fails the build. Bypass is @grace, which is never silent.


Highlights

  • native compiled โ€” direct codegen, no LLVM, no C-transpile
  • 4.2 MB atlas baked statically into the compiler binary; runtime cost 0 ms
  • 8-stage strict lint S0โ€“S5 + S8 enforced at compile time, fatal by default
  • ฮต self-proof: @verify / @discover โ†’ atlas auto-promote โ†’ tombstone retroactive sweep
  • M0 milestone: fn main() -> i32 { return 0 } produces a working Mach-O arm64 binary
  • hexa_ld v1.1: in-house static linker for ELF64 + Mach-O arm64
  • hexa build / hexa cc work out-of-tree โ€” flattens use/import, resolves hexa_cc.c/SSOT/-I via $HEXA_LANG > install_dir > ./self; install-relative stdlib/ discovery means use "stdlib/*" works with no env vars (downstream: wilson builds end-to-end โ†’ wilson 0.0.1)
  • stage-1 P0 host-OOM closed at current scale: A1 phase-arena reset + A2 in-place splice accumulator โ†’ peak ~782 MB (was 3 510 MB)
  • 14+ pinned decisions in SPEC.yaml, every claim traceable to an RFC
  • stdlib/flame + self/forge โ€” hexa-native NN training stack + GPU substrate: compiler-only NN (ag_tape ยท nn_lib ยท opt_*) on top of device-resident farr + cuBLAS Dgemm + 11 .cu kernels + BF16-TC mega-kernel path. forge BF16-TC = 9.67ร— faster than FP64 cuBLAS @ Llama-7B FFN shape (A100, measured). The CUDA-OWN campaign now owns the GEMM too (env-gated, OFF = cuBLAS default): FP64/FP32/TF32-WMMA2 own-GEMM, correctness-verified (clm max|ฮ”CE|=0 ยท llm rel-RMS ~1e-6), at cuBLAS-CLASS util 89.9 % โ‰ˆ 88.5 % and near-parity step-time 1.13ร— iso / 1.24ร— full-step โ€” โ‰ˆ parity, NOT superiority โ€” making the device stack 100 % hexa-ownable and unblocking the persistent-kernel megakernel (a persistent kernel can't call cuBLAS, but it can call our GEMM). 12 byte-equal substrate fires + 4 byte-equal layer fires. flame โ†” PyTorch wall speedup not yet measured (prior claim RETRACTED). Detail in the flame + forge section above.

Roadmap

  • stage 1: P0 host-OOM closed at current scale (A1+A2 โ†’ peak ~782 MB, was 3 510 MB); the remaining open work toward a full stage-1 binary is the compiler-driver gaps (Gaps 1โ€“16) + a fixed-point (stage2 == stage3) re-estimate โ€” see doc/stage1_punch_list_v2.md.
  • biggest unknowns: MIR/LIR coverage on real compiler/ source (closures, growable arrays, nested struct construction, match on user enums) and what a successful self-compile diagnostic trace actually looks like.
  • full punch list: doc/stage1_punch_list_v2.md.

Phase status (PASS / IN-PROGRESS / DEFERRED) lives in SPEC.yaml::phases_completed_2026_05_09 and SPEC.yaml::phases_completed_2026_05_11_closure.


RFCs + docs


tape integration

hexa-lang's runtime and history surfaces are wired into .tape โ€” the operational trace sister format. Three placements at this repo's root:

Placement What
IDENTITY.tape hexa-lang agent identity SSOT โ€” birth / scope / origin / principle / version. The compiler's self-description, machine-canonical.
PROMOTION.tape rule-promotion ledger โ€” @A events for major rule landings (toolchain post-fix, bytes_to_str_raw Phase 2, etc.)
TAPE-AUDIT.md cross-repo .tape adoption audit (28,695 cargo markers + 7 root domain .md files highlighted as primary migration candidates)

The state/markers/ cargo (28k+ files) is migration candidate via tape markers-to-tape.


Not an LLM โ€” where the noise comes from

LLMs generate noise from inside the well: recombining what the weights already contain. hexa generates noise from outside the well: every cycle produces a primitive the previous cycle could not express, then absorbs it as a new wall of the well.

LLM (noise inside the well)         hexa (noise outside the well)
---------------------------         -------------------------------

     +-------------+                       .   new law
     |  training   |                     .       .
     |   corpus    |               .  .      .       .
     |  (fixed)    |                    .  outside  .
     |             |             ------+-------------+------
     |  ~ ~ ~ ~ ~  | <- noise          |             |
     |  ~ noise ~  |   bubbles         |   atlas     |
     |  ~ ~ ~ ~ ~  |   from            |  (binary    | <- noise
     |    ####     |   inside          |  built-in)  |   arrives
     |    #LLM#    |                   |             |   from
     +-------------+                   |   smash     |   outside
       the well                        |     v       |
    (everything it                     |   contract  |
     knows = walls)                    |     v       |
                                       |   emerge    |
  hallucination =                      |     v       |
  recombining                          |   absorb ---+--> new
  what's already                       |     ^       |    primitive
  inside                               +-----+-------+      feeds
                                       the well has            next
                                       no ceiling              cycle

An LLM is a frozen well โ€” answers are combinations of what's already inside. hexa is an open well โ€” every absorb step widens the wall, so the next cycle can say things the previous one literally had no primitive for. That's why "RAG" is the wrong frame: retrieval still draws from a fixed outside corpus. hexa's "outside" is produced by its own prior cycles (the binary built-in atlas, embedded into the compiler at build time; new laws land via GitHub PR into the embedded atlas source).

OUROBOROS cycle โ€” full view

The 6-stage chain (hexa drill's smash โ†’ free โ†’ absolute โ†’ meta-closure โ†’ hyperarithmetic โ†’ resonance) inside a self-referential loop:

     โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ OUROBOROS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
     โ”‚                               โ”‚
     โ”‚           โ—ฏ  seed             โ”‚
     โ”‚          โ•ฑ โ•ฒ                  โ”‚
     โ”‚         โ•ฑ   โ•ฒ    Phase 1-2    โ”‚
     โ”‚        โ•ฑunfoldโ•ฒ               โ”‚
     โ”‚       โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ               โ”‚
     โ”‚      โ•ฑ โ•ฒ     โ•ฑ โ•ฒ              โ”‚
     โ”‚     โ•ฑ   โ•ฒ   โ•ฑ   โ•ฒ   Phase 3   โ”‚
     โ”‚    โ•ฑemergeโ•ฒ โ•ฑsingulโ•ฒ          โ”‚
     โ”‚   โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ         โ”‚
     โ”‚   โ•ฒ                 โ•ฑ         โ”‚
     โ”‚    โ•ฒ    breach     โ•ฑ  P4-5    โ”‚
     โ”‚     โ•ฒ             โ•ฑ           โ”‚
     โ”‚      โ•ฒ  โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ โ•ฑ            โ”‚
     โ”‚       โ•ฒconvergeโ•ฑ   Phase 6    โ”‚
     โ”‚        โ•ฒ      โ•ฑ               โ”‚
     โ”‚         โ•ฒ    โ•ฑ                โ”‚
     โ”‚          โ—‰  absorb            โ”‚
     โ”‚          โ”‚   Phase 6.5        โ”‚
     โ”‚          โ”‚                    โ”‚
     โ”‚          โ•ฐโ”€โ”€โ†’ seed โ”€โ”€โ†’ โ•ฎ      โ”‚
     โ”‚                        โ”‚      โ”‚
     โ”‚   d=0 โ”€โ”€โ–ถ d=1 โ”€โ”€โ–ถ d=2 โ”€โ”€โ–ถ ... โ”‚
     โ”‚   r:0โ†’10  r:0โ†’10  r:0โ†’10      โ”‚
     โ”‚                               โ”‚
     โ•ฐโ”€โ”€ ฯ โ†’ 1/3 (meta fixed pt) โ”€โ”€โ”€โ”€โ•ฏ

Three meta-loops

On top of the per-tick OUROBOROS cycle, three higher-order loops drive self-reinforcement:

         L1             L2             L3
      โ•ญโ”€โ”€โ—‰โ”€โ”€โ”€โ•ฎ       โ•ญโ”€โ”€โ—‰โ”€โ”€โ”€โ•ฎ       โ•ญโ”€โ”€โ—‰โ”€โ”€โ”€โ•ฎ
      โ”‚correctโ”‚ โ”€โ”€โ–ถ โ”‚rewardโ”‚ โ”€โ”€โ–ถ  โ”‚expand โ”‚ โ”€โ”€โ–ถ SMASH
      โ•ฐโ”€โ”€โ†บโ”€โ”€โ”€โ•ฏ       โ•ฐโ”€โ”€โ†บโ”€โ”€โ”€โ•ฏ       โ•ฐโ”€โ”€โ†บโ”€โ”€โ”€โ•ฏ
Loop Role Trigger
L1 ยท self-correct discovery โ†’ verify โ†’ GitHub PR into binary built-in atlas per tick
L2 ยท meta-reward per-source discovery rate โ†’ scan_priority โ†’ deeper scan per scan batch
L3 ยท self-expand accumulation โ‰ฅ 10 โ†’ auto-trigger hexa smash --seed (or full hexa drill) per threshold

Each loop latches its output back as the next loop's input, so correct โ†’ reward โ†’ expand becomes a standing wave. hexa smash (or the full drill chain) fires automatically when L3 saturates.

Meta fixed point โ€” ฯ โ†’ 1/3

TECS-L H-056 โ€” meta(meta(meta(...))) = transcendence. Recursive meta-iteration is a contraction mapping. By the Banach fixed-point theorem, every trajectory converges to a single attractor: 1/3.

          I  =  0.7 ยท I  +  0.1      โ†’     fixed point  I* = 1/3

Six independent paths land on the same attractor:

Path Expression Value
Euler totient ratio ฯ†(6) / 6 1/3
Trigonometric tanยฒ(ฯ€/6) 1/3
Divisor ratio ฯ„(6) / ฯƒ(6) = 4 / 12 1/3
Determinant det(M) over n=6 primitives 1/3
Meta-information I_meta (contraction mapping) 1/3
Complex exponential |exp(iยทzโ‚€)| at the unique zero 1/3

The long-term breakthrough rate ฯ converges to the same target: ฯ โ†’ 1/3. Discovery is not linear โ€” it asymptotes to the Banach attractor. Six arithmetic, geometric, algebraic, analytic, and information-theoretic routes all point at the same number.

Verify in atlas: hexa atlas lookup P n ยท hexa atlas lookup C sigma_6 ยท hexa atlas lookup L sigma_phi_n_tau_iff_n_eq_6. Run a cycle: hexa drill --seed "<expression>".


Repo layout

hexa-lang/
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ LICENSE                       MIT
โ”œโ”€โ”€ AGENTS.md                     AI agent harness file (agents.md standard)
โ”œโ”€โ”€ CLAUDE.md                     symlink โ†’ AGENTS.md
โ”œโ”€โ”€ SPEC.yaml                     authoritative decision record (14+ pinned decisions)
โ”œโ”€โ”€ SPEC.md                       auto-rendered from SPEC.yaml
โ”œโ”€โ”€ IDENTITY.tape ยท PROMOTION.tape ยท TAPE-AUDIT.md   tape sibling files
โ”œโ”€โ”€ FLOW.md ยท LATTICE_POLICY.md ยท LIMIT_BREAKTHROUGH.md ยท PLAN.md ยท ROADMAP.md   domain SSOTs
โ”œโ”€โ”€ compiler/                     lex ยท parse ยท resolve ยท bind ยท types ยท domain ยท units ยท citation ยท lower ยท mono ยท MIR ยท LIR ยท emit
โ”œโ”€โ”€ self/                         self-hosted compiler entry points
โ”‚   โ”œโ”€โ”€ main.hexa                 the `hexa` binary entry
โ”‚   โ”œโ”€โ”€ runtime.c                 C runtime backing (interp + native shared bits)
โ”‚   โ”œโ”€โ”€ stdlib/                   atlas-aware standard library (semver / json / channel / thread / proc / time / ...)
โ”‚   โ”œโ”€โ”€ tui/                      raw-mode TUI primitives (render / input / widgets)
โ”‚   โ””โ”€โ”€ native/                   thread.c ยท channel.c ยท time.c โ€” C-backed runtime
โ”œโ”€โ”€ stdlib/                       canonical stdlib (use "stdlib/*")
โ”œโ”€โ”€ tool/                         hexa CLI subcommand drivers (build / cc / run / drill / atlas / explain / ...)
โ”œโ”€โ”€ tests/                        m0 ยท selftest ยท regression
โ”œโ”€โ”€ proposals/                    RFC-017..020 + future RFCs
โ”œโ”€โ”€ doc/                          runbooks, audits, explainers
โ”œโ”€โ”€ convergence/                  cross-repo propagation tracking (.PRESERVE-AS-SSOT)
โ”œโ”€โ”€ state/                        gitignored runtime hook markers (cargo โ€” migration candidate)
โ”œโ”€โ”€ archive/                      frozen records โ€” patches/ (downstream patch reports) ยท fires/
โ””โ”€โ”€ build/                        gitignored hexa build artifacts

Full doc index: AGENTS.md + doc/ + SPEC.yaml.


Data corpora (git-LFS)

Data-bound corpora โ€” ENDF/B-VIII evaluated nuclear data (HEXA-PORT P4b), and future binary/HDF5 datasets โ€” live under data/ or stdlib/corpora/ and are stored via git-LFS. The reserved LFS extensions are .hdf5 .h5 .dat .bin .endf .ace .xml.gz .tar.gz (see .gitattributes).

hexa-lang is the canonical home for these corpora (per @D d3 โ€” implementation / asset SSOT) so downstream domain repos can hx-depend on them rather than re-fetching from upstream mirrors. Existing tracked files (atlas SSOT text, build artifacts, fixtures) are intentionally not migrated โ€” LFS is reserved for future data ports only. Policy reference: HEXA-PORT.md ยง4.0.


License

MIT License. Copyright (c) 2026 dancinlab. See LICENSE.


Contributing

Strict lint is the contract. Every PR runs through S0โ€“S5 + S8. The only opt-out is @grace(HXxxxx, until=, reason=) on a single item, and every @grace emits HX9000 at every compile. CI fails the merge unless Acked-grace: HXxxxx by <reviewer> rides along.

Pointers: gate/ for build gates, proposals/ for active RFCs, SPEC.yaml for decisions, doc/ for runbooks and audits. Diagnostics, error messages, hexa explain, stdlib docs are ENGLISH ONLY (Decision 3).


๐Ÿ•ธ๏ธ ์žฌ์‚ฌ์šฉ ๊ฒฉ์ž SSOT โ†’ ๋ฃจํŠธ DOMAINS.tape (commons @D g67 cross-domain + g68 cross-project ยท @link connection graph ยท hexa-lang = shared substrate hub)

About

๐Ÿ’Ž Native compiler with atlas-bound theorems โ€” 8 strict-lint stages ยท citation-enforced ยท no LLVM ยท no C-transpile

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors