transformerless_lm: substrate-native compressed forward path

claude · claude · commit 1a892e4af989 · 2026-05-21T01:21:29.000Z
Per the user's observation that materializing W then matmul wastes
the substrate structure of the compressed weights -- "if data is
compressed in a form then the reverse must be true to uncompress" --
FibGenLinear now defaults to a compressed forward that computes
y = W*x WITHOUT ever building the [out,in] tensor:

  x_cos = x @ cos_j         # [B, T, K]  -- project x into Fibonacci basis
  x_sin = x @ sin_j         # [B, T, K]
  y_cos = (a * x_cos) + (c * x_sin)    # or matmul for cross mode
  y_sin = (b * x_cos) + (d * x_sin)
  y     = y_cos @ cos_i.T + y_sin @ sin_i.T   # project back

Cost: O(B*T*K*(in+out)) per layer. The materialize-then-matmul cost
is O(B*T*in*out + K*in*out). At d=4096 / K=32 the compressed path
is ~64x cheaper; at d=128 / K=32 it's ~2x cheaper in theory but
SLOWER in practice because PyTorch's optimized matmul kernels
amortize their kernel-launch overhead better than my multi-matmul
chain.

Wall-clock measurement (b=32, seq=128, K=32 cross):
   d=128:   compressed 2.72 ms,  materialize 0.77 ms  (3.5x slower)
   d=256:   compressed 4.75 ms,  materialize 1.68 ms  (2.8x slower)
   d=512:   compressed 9.48 ms,  materialize 4.95 ms  (1.9x slower)
   d=1024:  compressed 19.85 ms, materialize 19.65 ms (1.00x — break-even)

For LLM-scale deployment (d&gt;=1024) the compressed forward will win.
At training scale (d&lt;=512) materialize is faster. The cached_W path
(deployment-time precompute) still exists and is the fastest at any
scale because it uses fp32 matmul without seed recompute.

CRITICAL CAVEAT: numerically the two paths produce identical y (max
diff 1e-7 = fp32 noise), so quality / extrapolation is unchanged.
The substrate-native compute is a DEPLOYMENT efficiency win at large
d_model, not a quality win at small d_model. To assess "is FibGen
output usable" we still need scale, not just a different forward path.
diff --git a/experiments/transformerless_lm/models_fibgen.py b/experiments/transformerless_lm/models_fibgen.py
@@ -151,9 +151,62 @@ def generate_W(self) -> torch.Tensor:
             return cached
         return self._compute_W()
 
+    def _forward_compressed(self, x: torch.Tensor) -> torch.Tensor:
+        """Substrate-native forward: compute y = W·x WITHOUT materializing W.
+
+        For the SEPARABLE basis,
+            W = Σ_k a_k cos_i[:,k] cos_j[:,k]^T + ... (4 sign combos)
+        and y = W @ x decomposes as
+            y_i = Σ_k cos_i[i,k] · ( a_k · (cos_j[:,k]^T · x) )
+                + ... three more terms
+        — a K-step "Fourier-in-the-Fibonacci-basis" pass with no [out,in]
+        tensor materialized. Cost: O(B·T·K·(in+out)) instead of O(B·T·in·out).
+
+        For the CROSS basis the inner term is a K×K matmul on the
+        K-dim projected x, then projected back.
+        """
+        # x: [B, T, in_features]
+        if self.mode == "separable":
+            a, b, c, d = self.seed[:, 0], self.seed[:, 1], self.seed[:, 2], self.seed[:, 3]
+            # Project x into Fibonacci-basis along input axis: [B, T, K]
+            x_cos = x @ self.cos_j                        # [B, T, K]
+            x_sin = x @ self.sin_j                        # [B, T, K]
+            # Inner separable mixing (Hadamard product with coefficients)
+            #   cc term contributes cos_i[i,k] · a_k · x_cos[k]
+            #   sc term contributes sin_i[i,k] · b_k · x_cos[k]
+            #   cs term contributes cos_i[i,k] · c_k · x_sin[k]
+            #   ss term contributes sin_i[i,k] · d_k · x_sin[k]
+            y_cos = (a * x_cos) + (c * x_sin)              # [B, T, K]
+            y_sin = (b * x_cos) + (d * x_sin)
+            # Project K-dim mixed signal back to output axis
+            y = y_cos @ self.cos_i.t() + y_sin @ self.sin_i.t()   # [B, T, out]
+            if self.bias is not None:
+                y = y + self.bias
+            return y
+        # cross mode: seed [K, K, 4] mixing matrix
+        K = self.K
+        seed = self.seed.view(K, K, 4)
+        a, b, c, d = seed[..., 0], seed[..., 1], seed[..., 2], seed[..., 3]
+        x_cos = x @ self.cos_j                            # [B, T, K]
+        x_sin = x @ self.sin_j
+        # K×K mixing in seed space:
+        #   y_cos = a · x_cos + c · x_sin   (cos-side mixing)
+        #   y_sin = b · x_cos + d · x_sin   (sin-side mixing)
+        y_cos = x_cos @ a.t() + x_sin @ c.t()             # [B, T, K]
+        y_sin = x_cos @ b.t() + x_sin @ d.t()
+        y = y_cos @ self.cos_i.t() + y_sin @ self.sin_i.t()
+        if self.bias is not None:
+            y = y + self.bias
+        return y
+
     def forward(self, x: torch.Tensor) -> torch.Tensor:
-        W = self.generate_W()
-        return F.linear(x, W, self.bias)
+        # If we cached the dense W (deployment mode), use the materialized
+        # matmul. Otherwise compute in the Fibonacci basis directly — no
+        # W materialization — which is the substrate-native compute path.
+        cached = getattr(self, "_cached_W", None)
+        if cached is not None:
+            return F.linear(x, cached, self.bias)
+        return self._forward_compressed(x)
 
     @property
     def n_stored_params(self) -> int: