Skip to content

Commit 1a892e4

Browse files
committed
transformerless_lm: substrate-native compressed forward path
Per the user's observation that materializing W then matmul wastes the substrate structure of the compressed weights -- "if data is compressed in a form then the reverse must be true to uncompress" -- FibGenLinear now defaults to a compressed forward that computes y = W*x WITHOUT ever building the [out,in] tensor: x_cos = x @ cos_j # [B, T, K] -- project x into Fibonacci basis x_sin = x @ sin_j # [B, T, K] y_cos = (a * x_cos) + (c * x_sin) # or matmul for cross mode y_sin = (b * x_cos) + (d * x_sin) y = y_cos @ cos_i.T + y_sin @ sin_i.T # project back Cost: O(B*T*K*(in+out)) per layer. The materialize-then-matmul cost is O(B*T*in*out + K*in*out). At d=4096 / K=32 the compressed path is ~64x cheaper; at d=128 / K=32 it's ~2x cheaper in theory but SLOWER in practice because PyTorch's optimized matmul kernels amortize their kernel-launch overhead better than my multi-matmul chain. Wall-clock measurement (b=32, seq=128, K=32 cross): d=128: compressed 2.72 ms, materialize 0.77 ms (3.5x slower) d=256: compressed 4.75 ms, materialize 1.68 ms (2.8x slower) d=512: compressed 9.48 ms, materialize 4.95 ms (1.9x slower) d=1024: compressed 19.85 ms, materialize 19.65 ms (1.00x — break-even) For LLM-scale deployment (d>=1024) the compressed forward will win. At training scale (d<=512) materialize is faster. The cached_W path (deployment-time precompute) still exists and is the fastest at any scale because it uses fp32 matmul without seed recompute. CRITICAL CAVEAT: numerically the two paths produce identical y (max diff 1e-7 = fp32 noise), so quality / extrapolation is unchanged. The substrate-native compute is a DEPLOYMENT efficiency win at large d_model, not a quality win at small d_model. To assess "is FibGen output usable" we still need scale, not just a different forward path.
1 parent 3df3aac commit 1a892e4

1 file changed

Lines changed: 55 additions & 2 deletions

File tree

experiments/transformerless_lm/models_fibgen.py

Lines changed: 55 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -151,9 +151,62 @@ def generate_W(self) -> torch.Tensor:
151151
return cached
152152
return self._compute_W()
153153

154+
def _forward_compressed(self, x: torch.Tensor) -> torch.Tensor:
155+
"""Substrate-native forward: compute y = W·x WITHOUT materializing W.
156+
157+
For the SEPARABLE basis,
158+
W = Σ_k a_k cos_i[:,k] cos_j[:,k]^T + ... (4 sign combos)
159+
and y = W @ x decomposes as
160+
y_i = Σ_k cos_i[i,k] · ( a_k · (cos_j[:,k]^T · x) )
161+
+ ... three more terms
162+
— a K-step "Fourier-in-the-Fibonacci-basis" pass with no [out,in]
163+
tensor materialized. Cost: O(B·T·K·(in+out)) instead of O(B·T·in·out).
164+
165+
For the CROSS basis the inner term is a K×K matmul on the
166+
K-dim projected x, then projected back.
167+
"""
168+
# x: [B, T, in_features]
169+
if self.mode == "separable":
170+
a, b, c, d = self.seed[:, 0], self.seed[:, 1], self.seed[:, 2], self.seed[:, 3]
171+
# Project x into Fibonacci-basis along input axis: [B, T, K]
172+
x_cos = x @ self.cos_j # [B, T, K]
173+
x_sin = x @ self.sin_j # [B, T, K]
174+
# Inner separable mixing (Hadamard product with coefficients)
175+
# cc term contributes cos_i[i,k] · a_k · x_cos[k]
176+
# sc term contributes sin_i[i,k] · b_k · x_cos[k]
177+
# cs term contributes cos_i[i,k] · c_k · x_sin[k]
178+
# ss term contributes sin_i[i,k] · d_k · x_sin[k]
179+
y_cos = (a * x_cos) + (c * x_sin) # [B, T, K]
180+
y_sin = (b * x_cos) + (d * x_sin)
181+
# Project K-dim mixed signal back to output axis
182+
y = y_cos @ self.cos_i.t() + y_sin @ self.sin_i.t() # [B, T, out]
183+
if self.bias is not None:
184+
y = y + self.bias
185+
return y
186+
# cross mode: seed [K, K, 4] mixing matrix
187+
K = self.K
188+
seed = self.seed.view(K, K, 4)
189+
a, b, c, d = seed[..., 0], seed[..., 1], seed[..., 2], seed[..., 3]
190+
x_cos = x @ self.cos_j # [B, T, K]
191+
x_sin = x @ self.sin_j
192+
# K×K mixing in seed space:
193+
# y_cos = a · x_cos + c · x_sin (cos-side mixing)
194+
# y_sin = b · x_cos + d · x_sin (sin-side mixing)
195+
y_cos = x_cos @ a.t() + x_sin @ c.t() # [B, T, K]
196+
y_sin = x_cos @ b.t() + x_sin @ d.t()
197+
y = y_cos @ self.cos_i.t() + y_sin @ self.sin_i.t()
198+
if self.bias is not None:
199+
y = y + self.bias
200+
return y
201+
154202
def forward(self, x: torch.Tensor) -> torch.Tensor:
155-
W = self.generate_W()
156-
return F.linear(x, W, self.bias)
203+
# If we cached the dense W (deployment mode), use the materialized
204+
# matmul. Otherwise compute in the Fibonacci basis directly — no
205+
# W materialization — which is the substrate-native compute path.
206+
cached = getattr(self, "_cached_W", None)
207+
if cached is not None:
208+
return F.linear(x, cached, self.bias)
209+
return self._forward_compressed(x)
157210

158211
@property
159212
def n_stored_params(self) -> int:

0 commit comments

Comments
 (0)