Commit 1a892e4
committed
transformerless_lm: substrate-native compressed forward path
Per the user's observation that materializing W then matmul wastes
the substrate structure of the compressed weights -- "if data is
compressed in a form then the reverse must be true to uncompress" --
FibGenLinear now defaults to a compressed forward that computes
y = W*x WITHOUT ever building the [out,in] tensor:
x_cos = x @ cos_j # [B, T, K] -- project x into Fibonacci basis
x_sin = x @ sin_j # [B, T, K]
y_cos = (a * x_cos) + (c * x_sin) # or matmul for cross mode
y_sin = (b * x_cos) + (d * x_sin)
y = y_cos @ cos_i.T + y_sin @ sin_i.T # project back
Cost: O(B*T*K*(in+out)) per layer. The materialize-then-matmul cost
is O(B*T*in*out + K*in*out). At d=4096 / K=32 the compressed path
is ~64x cheaper; at d=128 / K=32 it's ~2x cheaper in theory but
SLOWER in practice because PyTorch's optimized matmul kernels
amortize their kernel-launch overhead better than my multi-matmul
chain.
Wall-clock measurement (b=32, seq=128, K=32 cross):
d=128: compressed 2.72 ms, materialize 0.77 ms (3.5x slower)
d=256: compressed 4.75 ms, materialize 1.68 ms (2.8x slower)
d=512: compressed 9.48 ms, materialize 4.95 ms (1.9x slower)
d=1024: compressed 19.85 ms, materialize 19.65 ms (1.00x — break-even)
For LLM-scale deployment (d>=1024) the compressed forward will win.
At training scale (d<=512) materialize is faster. The cached_W path
(deployment-time precompute) still exists and is the fastest at any
scale because it uses fp32 matmul without seed recompute.
CRITICAL CAVEAT: numerically the two paths produce identical y (max
diff 1e-7 = fp32 noise), so quality / extrapolation is unchanged.
The substrate-native compute is a DEPLOYMENT efficiency win at large
d_model, not a quality win at small d_model. To assess "is FibGen
output usable" we still need scale, not just a different forward path.1 parent 3df3aac commit 1a892e4
1 file changed
Lines changed: 55 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
151 | 151 | | |
152 | 152 | | |
153 | 153 | | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
154 | 202 | | |
155 | | - | |
156 | | - | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
157 | 210 | | |
158 | 211 | | |
159 | 212 | | |
| |||
0 commit comments