A quick reference for the core mathematical operations used in LLM inference. Written for systems programmers — think of these as the "library functions" that get called thousands of times per token.
Multiply corresponding elements and sum:
dot(a, b) = a[0]*b[0] + a[1]*b[1] + ... + a[n-1]*b[n-1]
= sum_i(a[i] * b[i])
Usage: Core of attention scores (Q · K), computing similarity between vectors.
Performance: O(n), bandwidth-bound (reads 2n values, writes 1 scalar).
Each output element is a dot product of a matrix row with the input vector:
y[i] = sum_j(W[i][j] * x[j])
Example (3×4 matrix × 4-element vector):
[w00 w01 w02 w03] [x0] [w00*x0 + w01*x1 + w02*x2 + w03*x3]
[w10 w11 w12 w13] × [x1] = [w10*x0 + w11*x1 + w12*x2 + w13*x3]
[w20 w21 w22 w23] [x2] [w20*x0 + w21*x1 + w22*x2 + w23*x3]
[x3]
Usage: Every linear projection in the model (Q/K/V projections, FFN layers, output logits).
Performance: O(n×k) multiply-accumulates. For decode (generating one token at a time), this is ~95% of inference time. Memory-bandwidth bound — reading the weight matrix is the bottleneck.
Forms a matrix from two vectors (column × row):
A[i][j] = a[i] * b[j]
Example (3-element × 4-element):
[a0] [a0*b0 a0*b1 a0*b2 a0*b3]
[a1] × [b0 b1 b2 b3] = [a1*b0 a1*b1 a1*b2 a1*b3]
[a2] [a2*b0 a2*b1 a2*b2 a2*b3]
Usage: DeltaNet state updates (S += k ⊗ v where ⊗ is outer product).
Performance: O(n×m), produces an n×m matrix from two vectors.
What they are: Three different linear transformations (matrix-vector multiplies) of the same input hidden state, using three different learned weight matrices.
Given input hidden state x (e.g., 2048-dimensional vector):
Q = W_q @ x (Query projection)
K = W_k @ x (Key projection)
V = W_v @ x (Value projection)
Each weight matrix transforms x into a different representation:
- W_q: [n_heads × head_dim, hidden_dim] → produces Query
- W_k: [n_kv_heads × head_dim, hidden_dim] → produces Key
- W_v: [n_kv_heads × head_dim, hidden_dim] → produces Value
Example (simplified, single head, hidden_dim=4, head_dim=3):
x = [1.0, 2.0, 3.0, 4.0]
W_q = [[0.1, 0.2, 0.3, 0.4], Q = W_q @ x = [3.0,
[0.5, 0.6, 0.7, 0.8], → 7.0,
[0.9, 1.0, 1.1, 1.2]] 11.0]
W_k = [[0.2, 0.3, 0.4, 0.5], K = W_k @ x = [4.0,
[0.6, 0.7, 0.8, 0.9], → 8.0,
[1.0, 1.1, 1.2, 1.3]] 12.0]
W_v = [[0.3, 0.4, 0.5, 0.6], V = W_v @ x = [5.0,
[0.7, 0.8, 0.9, 1.0], → 9.0,
[1.1, 1.2, 1.3, 1.4]] 13.0]
Why three different projections?
- Query (Q): Represents "what this token is looking for" in other tokens
- Key (K): Represents "what this token offers" to be matched against queries
- Value (V): Represents "the actual information this token carries"
The Q and K projections are used to compute attention scores (how much each token should attend to each other token). The V projection contains the actual information that gets mixed based on those scores.
Once we have Q and K, we compute similarity scores via dot products:
For token i attending to token j:
score[i][j] = (Q[i] · K[j]) / sqrt(head_dim)
Example (continuing from above, head_dim=3):
Q[0] = [3.0, 7.0, 11.0]
K[0] = [4.0, 8.0, 12.0]
score = (3.0×4.0 + 7.0×8.0 + 11.0×12.0) / sqrt(3)
= (12 + 56 + 132) / 1.732
= 200 / 1.732
≈ 115.5
The division by sqrt(head_dim) (called scaled dot-product attention) prevents scores from growing too large as head_dim increases, which would make softmax too peaked.
Full attention mechanism:
1. Compute Q, K, V for all tokens
2. Compute scores: S[i][j] = (Q[i] · K[j]) / sqrt(head_dim)
3. Apply softmax per row: weights[i][j] = softmax(S[i])
4. Weighted sum of values: output[i] = sum_j(weights[i][j] × V[j])
Multi-head attention: Repeat this process with different W_q, W_k, W_v matrices for each head, concatenate outputs.
Sliding window that combines nearby values using learned weights. "Causal" means it only looks backward (at past inputs):
y[t] = w[0]*x[t] + w[1]*x[t-1] + w[2]*x[t-2] + ... + w[k-1]*x[t-k+1]
= sum_i(w[i] * x[t-i]) for i in 0..kernel_size
Example (kernel_size=3, weights=[0.5, 0.3, 0.2]):
x = [a, b, c, d, e]
y[0] = 0.5*a
y[1] = 0.5*b + 0.3*a
y[2] = 0.5*c + 0.3*b + 0.2*a
y[3] = 0.5*d + 0.3*c + 0.2*b
y[4] = 0.5*e + 0.3*d + 0.2*c
Usage: DeltaNet and Mamba-2 preprocessing — mixes information from recent time steps before the recurrence.
Implementation: Ring buffer stores last k-1 inputs to avoid shifting arrays.
Converts raw scores into probabilities that sum to 1.0:
softmax(x)[i] = exp(x[i]) / sum_j(exp(x[j]))
Example:
Input: [2.0, 1.0, 0.1]
Exp: [7.39, 2.72, 1.11] (sum = 11.22)
Output: [0.66, 0.24, 0.10] (sum = 1.00)
Usage: Attention weights (convert attention scores to probabilities), MoE routing (select experts), sampling (convert logits to token probabilities).
Numerical stability trick: Subtract max before exp to prevent overflow:
x_shifted = x - max(x)
softmax(x) = exp(x_shifted) / sum(exp(x_shifted))
Scale vector to unit RMS (Root Mean Square), then apply learned weights:
rms = sqrt(mean(x²) + eps)
rmsNorm(x, w) = (x / rms) * w
Example (eps=1e-6, weight=[1.0, 1.0, 1.0]):
x = [2.0, 4.0, 4.0]
mean(x²) = (4 + 16 + 16) / 3 = 12
rms = sqrt(12 + 1e-6) ≈ 3.464
output = [2.0/3.464, 4.0/3.464, 4.0/3.464] * w
≈ [0.577, 1.155, 1.155]
Usage: Applied before every attention and FFN sublayer. Stabilizes training and inference by preventing activation magnitudes from exploding.
Why RMS not mean: Simpler (no mean subtraction), empirically just as effective as LayerNorm.
Scale vector to unit length (magnitude = 1), no learnable weights:
magnitude = sqrt(sum(x[i]²) + eps)
l2Norm(x) = x / magnitude
Usage: DeltaNet Q/K normalization before recurrence (prevents numerical instability in the state update).
Non-linear transformations applied element-wise (independently to each value).
Smooth activation with gating property:
silu(x) = x * sigmoid(x)
= x / (1 + exp(-x))
Shape: Smooth S-curve that's negative for x<0, close to linear for x>3.
Usage: Most FFN layers (SwiGLU), causal convolution, SSM gating.
Smoother than ReLU, approximates Gaussian CDF:
gelu(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715*x³)))
Usage: Gemma3 FFN (instead of SiLU).
Maps any value to (0, 1):
sigmoid(x) = 1 / (1 + exp(-x))
Output range: (0, 1) — never exactly 0 or 1.
Usage: Gating (how much signal to let through), MoE routing (GLM-4), attention gate (Qwen3.5).
Smooth approximation of ReLU, always positive:
softplus(x) = log(1 + exp(x))
Approximation: Linear for x > 20, ≈ exp(x) for x < -20.
Usage: SSM dt (timestep) computation — ensures positive decay factors.
Maps any value to (-1, 1):
tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
= 2*sigmoid(2*x) - 1
Usage: Logit softcapping in Gemma3 (tanh(x/cap) * cap clamps to ±cap smoothly).
Find the index of the maximum value:
argmax(x) = index i where x[i] is largest
Example:
x = [0.1, 0.8, 0.3, 0.5]
argmax(x) = 1 (x[1] = 0.8 is largest)
Usage: Greedy decoding (temperature=0) — always pick the highest-scoring token.
Implementation: Single-pass linear scan, O(n).
Scale logits to control randomness before softmax:
adjusted_logits = logits / temperature
temperature → 0: peaked distribution (greedy)
temperature = 1: unchanged
temperature → ∞: uniform distribution
Effect: Lower temp → more deterministic (top token dominates). Higher temp → more random (flatter probabilities).
Keep only the K highest-scoring tokens, set rest to -∞:
1. Sort tokens by score descending
2. Keep top K, mask rest
3. Renormalize (softmax over remaining K)
Usage: Prevent sampling extremely unlikely tokens at high temperatures.
Keep smallest set of tokens whose cumulative probability ≥ P:
1. Sort tokens by probability descending
2. Accumulate probabilities until sum ≥ P
3. Keep those tokens, mask rest
4. Renormalize
Adaptive: When model is confident (one token = 90%), keeps 1-2 tokens. When uncertain (many similar scores), keeps dozens.
Sum: sum(x) = x[0] + x[1] + ... + x[n-1]
Mean: mean(x) = sum(x) / n
Max: max(x) = largest element
Usage: Building blocks for softmax (sum/max), normalization (mean), reductions in GPU kernels.
GPU implementation: Parallel reduction — each thread reduces a chunk, then combine results in shared memory using tree reduction.
Applied independently to each element:
add(a, b)[i] = a[i] + b[i]
mul(a, b)[i] = a[i] * b[i]
Usage: Residual connections (x = x + f(x)), gating (output = data * gate).
Performance: Memory-bandwidth bound (2 reads + 1 write per element), trivially parallel.
Matrix-vector multiply is ~95% of decode time. Every linear layer (Linear(in, out)) is a GEMV:
- Q/K/V projections: 3 GEMVs per layer
- FFN: 3 GEMVs per layer (gate, up, down)
- Output logits: 1 GEMV (largest — vocab_size rows)
A 28-layer model with vocab_size=128K does ~210 GEMVs per token.
Bandwidth-bound (GEMV, normalization, activations): Time spent waiting for memory reads/writes dominates. Arithmetic is trivial. Quantization helps enormously (4× less data to read).
Compute-bound (attention for long sequences, matrix-matrix multiply during prefill): Arithmetic dominates. GPU compute power matters more than memory speed.
For single-token decode, everything is bandwidth-bound.
In-place (modifies input): rope(x) rotates x directly. Zero allocations.
Allocating (creates output): softmax(x) produces new array (can't normalize in-place — need all values to compute sum first).
Inference hot path is allocation-free — all buffers pre-allocated, operations reuse scratch space.
See also:
- Chapter 2 (attention, RoPE, RMSNorm)
- Chapter 3 (activation functions, MoE)
- Chapter 4 (GEMV, quantization)
- Chapter 6 (convolution, outer product, SSM recurrence)
- Chapter 7 (sampling operations)
Next: Appendix: Compile-Time Optimization → | Back: Chapter 17: Speculative Decoding ←