Appendix: Mathematical Operations Reference

A quick reference for the core mathematical operations used in LLM inference. Written for systems programmers — think of these as the "library functions" that get called thousands of times per token.

Vector and Matrix Operations

Dot Product

Multiply corresponding elements and sum:

dot(a, b) = a[0]*b[0] + a[1]*b[1] + ... + a[n-1]*b[n-1]
          = sum_i(a[i] * b[i])

Usage: Core of attention scores (Q · K), computing similarity between vectors.

Performance: O(n), bandwidth-bound (reads 2n values, writes 1 scalar).

Matrix-Vector Multiply (GEMV)

Each output element is a dot product of a matrix row with the input vector:

y[i] = sum_j(W[i][j] * x[j])

Example (3×4 matrix × 4-element vector):
[w00 w01 w02 w03]   [x0]   [w00*x0 + w01*x1 + w02*x2 + w03*x3]
[w10 w11 w12 w13] × [x1] = [w10*x0 + w11*x1 + w12*x2 + w13*x3]
[w20 w21 w22 w23]   [x2]   [w20*x0 + w21*x1 + w22*x2 + w23*x3]
                    [x3]

Usage: Every linear projection in the model (Q/K/V projections, FFN layers, output logits).

Performance: O(n×k) multiply-accumulates. For decode (generating one token at a time), this is ~95% of inference time. Memory-bandwidth bound — reading the weight matrix is the bottleneck.

Outer Product

Forms a matrix from two vectors (column × row):

A[i][j] = a[i] * b[j]

Example (3-element × 4-element):
[a0]              [a0*b0  a0*b1  a0*b2  a0*b3]
[a1] × [b0 b1 b2 b3] = [a1*b0  a1*b1  a1*b2  a1*b3]
[a2]              [a2*b0  a2*b1  a2*b2  a2*b3]

Usage: DeltaNet state updates (S += k ⊗ v where ⊗ is outer product).

Performance: O(n×m), produces an n×m matrix from two vectors.

Attention-Specific Operations

Q/K/V Projections

What they are: Three different linear transformations (matrix-vector multiplies) of the same input hidden state, using three different learned weight matrices.

Given input hidden state x (e.g., 2048-dimensional vector):

Q = W_q @ x    (Query projection)
K = W_k @ x    (Key projection)
V = W_v @ x    (Value projection)

Each weight matrix transforms x into a different representation:
- W_q: [n_heads × head_dim, hidden_dim] → produces Query
- W_k: [n_kv_heads × head_dim, hidden_dim] → produces Key
- W_v: [n_kv_heads × head_dim, hidden_dim] → produces Value

Example (simplified, single head, hidden_dim=4, head_dim=3):

x = [1.0, 2.0, 3.0, 4.0]

W_q = [[0.1, 0.2, 0.3, 0.4],      Q = W_q @ x = [3.0,
       [0.5, 0.6, 0.7, 0.8],  →                  7.0,
       [0.9, 1.0, 1.1, 1.2]]                     11.0]

W_k = [[0.2, 0.3, 0.4, 0.5],      K = W_k @ x = [4.0,
       [0.6, 0.7, 0.8, 0.9],  →                  8.0,
       [1.0, 1.1, 1.2, 1.3]]                     12.0]

W_v = [[0.3, 0.4, 0.5, 0.6],      V = W_v @ x = [5.0,
       [0.7, 0.8, 0.9, 1.0],  →                  9.0,
       [1.1, 1.2, 1.3, 1.4]]                     13.0]

Why three different projections?

Query (Q): Represents "what this token is looking for" in other tokens
Key (K): Represents "what this token offers" to be matched against queries
Value (V): Represents "the actual information this token carries"

The Q and K projections are used to compute attention scores (how much each token should attend to each other token). The V projection contains the actual information that gets mixed based on those scores.

Attention Score Computation

Once we have Q and K, we compute similarity scores via dot products:

For token i attending to token j:
score[i][j] = (Q[i] · K[j]) / sqrt(head_dim)

Example (continuing from above, head_dim=3):
Q[0] = [3.0, 7.0, 11.0]
K[0] = [4.0, 8.0, 12.0]

score = (3.0×4.0 + 7.0×8.0 + 11.0×12.0) / sqrt(3)
      = (12 + 56 + 132) / 1.732
      = 200 / 1.732
      ≈ 115.5

The division by sqrt(head_dim) (called scaled dot-product attention) prevents scores from growing too large as head_dim increases, which would make softmax too peaked.

Full attention mechanism:

1. Compute Q, K, V for all tokens
2. Compute scores: S[i][j] = (Q[i] · K[j]) / sqrt(head_dim)
3. Apply softmax per row: weights[i][j] = softmax(S[i])
4. Weighted sum of values: output[i] = sum_j(weights[i][j] × V[j])

Multi-head attention: Repeat this process with different W_q, W_k, W_v matrices for each head, concatenate outputs.

Convolution (1D Causal)

Sliding window that combines nearby values using learned weights. "Causal" means it only looks backward (at past inputs):

y[t] = w[0]*x[t] + w[1]*x[t-1] + w[2]*x[t-2] + ... + w[k-1]*x[t-k+1]
     = sum_i(w[i] * x[t-i])   for i in 0..kernel_size

Example (kernel_size=3, weights=[0.5, 0.3, 0.2]):

x = [a, b, c, d, e]
y[0] = 0.5*a
y[1] = 0.5*b + 0.3*a
y[2] = 0.5*c + 0.3*b + 0.2*a
y[3] = 0.5*d + 0.3*c + 0.2*b
y[4] = 0.5*e + 0.3*d + 0.2*c

Usage: DeltaNet and Mamba-2 preprocessing — mixes information from recent time steps before the recurrence.

Implementation: Ring buffer stores last k-1 inputs to avoid shifting arrays.

Normalization Operations

Softmax

Converts raw scores into probabilities that sum to 1.0:

softmax(x)[i] = exp(x[i]) / sum_j(exp(x[j]))

Example:

Input:  [2.0, 1.0, 0.1]
Exp:    [7.39, 2.72, 1.11]    (sum = 11.22)
Output: [0.66, 0.24, 0.10]    (sum = 1.00)

Usage: Attention weights (convert attention scores to probabilities), MoE routing (select experts), sampling (convert logits to token probabilities).

Numerical stability trick: Subtract max before exp to prevent overflow:

x_shifted = x - max(x)
softmax(x) = exp(x_shifted) / sum(exp(x_shifted))

RMS Normalization (RMSNorm)

Scale vector to unit RMS (Root Mean Square), then apply learned weights:

rms = sqrt(mean(x²) + eps)
rmsNorm(x, w) = (x / rms) * w

Example (eps=1e-6, weight=[1.0, 1.0, 1.0]):

x = [2.0, 4.0, 4.0]
mean(x²) = (4 + 16 + 16) / 3 = 12
rms = sqrt(12 + 1e-6) ≈ 3.464
output = [2.0/3.464, 4.0/3.464, 4.0/3.464] * w
       ≈ [0.577, 1.155, 1.155]

Usage: Applied before every attention and FFN sublayer. Stabilizes training and inference by preventing activation magnitudes from exploding.

Why RMS not mean: Simpler (no mean subtraction), empirically just as effective as LayerNorm.

L2 Normalization

Scale vector to unit length (magnitude = 1), no learnable weights:

magnitude = sqrt(sum(x[i]²) + eps)
l2Norm(x) = x / magnitude

Usage: DeltaNet Q/K normalization before recurrence (prevents numerical instability in the state update).

Activation Functions

Non-linear transformations applied element-wise (independently to each value).

SiLU (Swish)

Smooth activation with gating property:

silu(x) = x * sigmoid(x)
        = x / (1 + exp(-x))

Shape: Smooth S-curve that's negative for x<0, close to linear for x>3.

Usage: Most FFN layers (SwiGLU), causal convolution, SSM gating.

GELU (Gaussian Error Linear Unit)

Smoother than ReLU, approximates Gaussian CDF:

gelu(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715*x³)))

Usage: Gemma3 FFN (instead of SiLU).

Sigmoid

Maps any value to (0, 1):

sigmoid(x) = 1 / (1 + exp(-x))

Output range: (0, 1) — never exactly 0 or 1.

Usage: Gating (how much signal to let through), MoE routing (GLM-4), attention gate (Qwen3.5).

Softplus

Smooth approximation of ReLU, always positive:

softplus(x) = log(1 + exp(x))

Approximation: Linear for x > 20, ≈ exp(x) for x < -20.

Usage: SSM dt (timestep) computation — ensures positive decay factors.

Tanh (Hyperbolic Tangent)

Maps any value to (-1, 1):

tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
        = 2*sigmoid(2*x) - 1

Usage: Logit softcapping in Gemma3 (tanh(x/cap) * cap clamps to ±cap smoothly).

Sampling Operations

Argmax

Find the index of the maximum value:

argmax(x) = index i where x[i] is largest

Example:
x = [0.1, 0.8, 0.3, 0.5]
argmax(x) = 1    (x[1] = 0.8 is largest)

Usage: Greedy decoding (temperature=0) — always pick the highest-scoring token.

Implementation: Single-pass linear scan, O(n).

Temperature Scaling

Scale logits to control randomness before softmax:

adjusted_logits = logits / temperature

temperature → 0:   peaked distribution (greedy)
temperature = 1:   unchanged
temperature → ∞:   uniform distribution

Effect: Lower temp → more deterministic (top token dominates). Higher temp → more random (flatter probabilities).

Top-K Selection

Keep only the K highest-scoring tokens, set rest to -∞:

1. Sort tokens by score descending
2. Keep top K, mask rest
3. Renormalize (softmax over remaining K)

Usage: Prevent sampling extremely unlikely tokens at high temperatures.

Top-P (Nucleus Sampling)

Keep smallest set of tokens whose cumulative probability ≥ P:

1. Sort tokens by probability descending
2. Accumulate probabilities until sum ≥ P
3. Keep those tokens, mask rest
4. Renormalize

Adaptive: When model is confident (one token = 90%), keeps 1-2 tokens. When uncertain (many similar scores), keeps dozens.

Special Operations

Reduction Operations

Sum: sum(x) = x[0] + x[1] + ... + x[n-1]

Mean: mean(x) = sum(x) / n

Max: max(x) = largest element

Usage: Building blocks for softmax (sum/max), normalization (mean), reductions in GPU kernels.

GPU implementation: Parallel reduction — each thread reduces a chunk, then combine results in shared memory using tree reduction.

Element-wise Operations

Applied independently to each element:

add(a, b)[i] = a[i] + b[i]
mul(a, b)[i] = a[i] * b[i]

Usage: Residual connections (x = x + f(x)), gating (output = data * gate).

Performance: Memory-bandwidth bound (2 reads + 1 write per element), trivially parallel.

Common Patterns

GEMV dominates inference

Matrix-vector multiply is ~95% of decode time. Every linear layer (Linear(in, out)) is a GEMV:

Q/K/V projections: 3 GEMVs per layer
FFN: 3 GEMVs per layer (gate, up, down)
Output logits: 1 GEMV (largest — vocab_size rows)

A 28-layer model with vocab_size=128K does ~210 GEMVs per token.

Bandwidth vs Compute Bound

Bandwidth-bound (GEMV, normalization, activations): Time spent waiting for memory reads/writes dominates. Arithmetic is trivial. Quantization helps enormously (4× less data to read).

Compute-bound (attention for long sequences, matrix-matrix multiply during prefill): Arithmetic dominates. GPU compute power matters more than memory speed.

For single-token decode, everything is bandwidth-bound.

In-place vs Allocating

In-place (modifies input): rope(x) rotates x directly. Zero allocations.

Allocating (creates output): softmax(x) produces new array (can't normalize in-place — need all values to compute sum first).

Inference hot path is allocation-free — all buffers pre-allocated, operations reuse scratch space.

See also:

Chapter 2 (attention, RoPE, RMSNorm)
Chapter 3 (activation functions, MoE)
Chapter 4 (GEMV, quantization)
Chapter 6 (convolution, outer product, SSM recurrence)
Chapter 7 (sampling operations)

Next: Appendix: Compile-Time Optimization → | Back: Chapter 17: Speculative Decoding ←

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appendix: Mathematical Operations Reference

Vector and Matrix Operations

Dot Product

Matrix-Vector Multiply (GEMV)

Outer Product

Attention-Specific Operations

Q/K/V Projections

Attention Score Computation

Convolution (1D Causal)

Normalization Operations

Softmax

RMS Normalization (RMSNorm)

L2 Normalization

Activation Functions

SiLU (Swish)

GELU (Gaussian Error Linear Unit)

Sigmoid

Softplus

Tanh (Hyperbolic Tangent)

Sampling Operations

Argmax

Temperature Scaling

Top-K Selection

Top-P (Nucleus Sampling)

Special Operations

Reduction Operations

Element-wise Operations

Common Patterns

GEMV dominates inference

Bandwidth vs Compute Bound

In-place vs Allocating

FilesExpand file tree

appendix-math.md

Latest commit

History

appendix-math.md

File metadata and controls

Appendix: Mathematical Operations Reference

Vector and Matrix Operations

Dot Product

Matrix-Vector Multiply (GEMV)

Outer Product

Attention-Specific Operations

Q/K/V Projections

Attention Score Computation

Convolution (1D Causal)

Normalization Operations

Softmax

RMS Normalization (RMSNorm)

L2 Normalization

Activation Functions

SiLU (Swish)

GELU (Gaussian Error Linear Unit)

Sigmoid

Softplus

Tanh (Hyperbolic Tangent)

Sampling Operations

Argmax

Temperature Scaling

Top-K Selection

Top-P (Nucleus Sampling)

Special Operations

Reduction Operations

Element-wise Operations

Common Patterns

GEMV dominates inference

Bandwidth vs Compute Bound

In-place vs Allocating