LLM Inference From Scratch

A tutorial series that builds understanding of LLM inference from the ground up. Written for systems programmers who want to understand how LLMs work under the hood — no AI/ML background required.

Each chapter introduces one major concept at a time, explaining both the why (the problem being solved) and the how (the algorithm and implementation). All AI/ML terminology is explained inline when first mentioned.

What You'll Learn

By the end of this tutorial, you'll understand:

How text becomes numbers: Tokenization, embeddings, and vocabulary projection
The transformer architecture: Attention mechanisms, position encoding, residual connections, and normalization
Feed-forward networks: Activation functions, gated linear units, and Mixture of Experts (MoE)
Quantization: How to compress 32-bit weights down to 4 bits (or less) while maintaining quality
Memory management: KV caching strategies (flat, paged, radix tree) and why they matter for performance
State space models: Linear-time alternatives to quadratic attention (DeltaNet, Mamba-2)
Sampling strategies: How temperature, top-k, top-p, and repeat penalty control randomness
Compute backends: How CPU, GPU (CUDA, Metal, Vulkan, ROCm, WebGPU) backends execute kernels and manage memory
Speculative decoding: Draft models, DDTree tree construction, self-speculative layer skipping

Prerequisites

Systems programming knowledge: Comfortable reading code that manages memory, writes tight loops, and thinks about cache locality
Basic linear algebra: If you've forgotten (or never learned) matrix-vector multiply, dot products, etc., see the Math Reference — we explain everything you need
No ML background needed: We explain transformers, attention, embeddings, etc. from first principles

If you can read Zig, C, or Rust code and understand concepts like "cache line" and "SIMD", you're ready.

Reading Paths

Different readers have different goals. Here are recommended paths through the tutorials:

🎓 ML Beginners (Systems Programmers New to ML)

Start from the beginning and read sequentially. Chapters 1-8 build understanding from first principles:

Chapters 1-4 — Core concepts (tokens, transformers, quantization)
Chapters 5-8 — Advanced concepts (caching, SSMs, sampling, backends)
Chapters 9-16 — Implementation patterns (SIMD, memory safety, backend internals)

🔧 Implementation-Focused (Experienced ML Engineers)

You already know transformers and attention — jump straight to implementation:

Chapter 9: CPU SIMD — @Vector patterns, multi-row batching
Chapter 11: Metal Backend — GPU optimization on Apple Silicon
Chapter 13: Batched Dispatch — Kernel fusion, dispatch reduction
Appendix: Profiling — Performance debugging techniques

⚡ Performance Optimization

Focus on chapters that explain speedup techniques:

Chapter 4: Quantization — MLX factored dequantization (30-40% speedup)
Chapter 9: CPU SIMD — Multi-row GEMV batching (2-4× speedup)
Chapter 13: Batched Dispatch — Qwen3.5 optimization journey (15% speedup)
Appendix: Compile-Time — Lookup tables (20-30× for FP8 dequant)

🦀 Zig-Specific Patterns (Rust/C Programmers)

Learn Zig idioms used throughout the codebase:

Chapter 9: CPU SIMD — @Vector, @reduce, @mulAdd, @splat
Chapter 10: Memory Safety — defer, errdefer, leak detection
Chapter 12: CPU Parallelism — Futex-based thread pool, atomic operations
Appendix: Compile-Time — comptime, @embedFile, inline else dispatch
Appendix: Atomic Operations — Memory ordering, lock-free patterns

📐 Architecture & Design Patterns

Understand how the codebase is structured:

Chapter 8: Backends — Tagged union dispatch pattern
Chapter 14: Format Conventions — GGUF vs SafeTensors differences
Chapter 15: Chat Templates — Data-driven configuration
Chapter 16: Recipe System — Per-model/hardware defaults

🛠️ Adding a New Model

Everything you need to add a new architecture to Agave:

Chapter 14: Format Conventions — Tensor naming, dimension order, format detection
Chapter 15: Chat Templates — Prompt formatting and EOG tokens
Chapter 16: Recipe System — Per-model defaults
Chapter 8: Backends — Dispatcher pattern and kernel interface
Chapter 13: Batched Dispatch — Tier 3 composed megakernels (auto-generated from ModelDesc)

Reading Order

#	Chapter	What You'll Learn	~Time
1	Tokens and Text	How text becomes numbers the model can process	7 min
2	The Transformer	The core architecture: attention, position encoding, normalization	12 min
3	Feed-Forward Networks	Activation functions, SwiGLU, MoE, megakernel fusion	5 min
4	Quantization	Compressing weights from 32 bits to 4 bits; MLX, TurboQuant, PlanarQuant	15 min
5	Memory and Caching	KV cache, PagedAttention, paged SDPA, RadixAttention	10 min
6	State Space Models	Linear-time alternatives to attention: DeltaNet and Mamba-2	7 min
7	Sampling	Temperature, top-k, top-p, min-p, penalties, grammar constraints	5 min
8	Backends	CPU, CUDA, Metal, Vulkan, ROCm, WebGPU — dispatchers and paged SDPA	8 min
9	CPU SIMD Optimization	@Vector, @reduce, @mulAdd, multi-row batching, quantized GEMV	12 min
10	Memory Safety	defer, errdefer, guaranteed cleanup, leak detection	8 min
11	Metal Backend Internals	UMA, buffer caching, command buffers, batch mode, threadgroup limits	12 min
12	CPU Parallelism	Futex-based thread pool, work-stealing, atomic counters	10 min
13	Batched Dispatch and Fusion	gemvMulti, fused ops, megakernel system (3 tiers)	16 min
14	Format Conventions	GGUF vs SafeTensors differences, tensor layout, metadata mapping	12 min
15	Chat Templates	Data-driven role markers, EOG tokens, multi-turn formatting	12 min
16	Recipe System	Proven defaults per model+hardware, user override semantics	10 min
17	Speculative Decoding & DDTree	Draft models, tree-structured verification, self-speculative layer skip	8 min

Appendices:

Mathematical Operations Reference — Quick reference for all math operations (dot product, softmax, GEMV, convolution, etc.)
Compile-Time Optimization — comptime keyword, @embedFile, lookup tables, feature detection, type specialization
Profiling and Debugging — --profile flag, dispatch counters, missing kernel policy, regression detection
Atomic Operations and Memory Ordering — std.atomic.Value, memory ordering semantics, lock-free patterns

How This Relates to the Code

Each chapter references the Agave source files that implement the concepts discussed. The code follows the same layered structure as these tutorials — understanding the concepts makes the code straightforward to read.

For product documentation (project structure, module reference, supported models), see:

Architecture — project structure and module reference
Models — supported models and performance benchmarks
Kernel Status — per-backend kernel implementation status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Inference From Scratch

What You'll Learn

Prerequisites

Reading Paths

🎓 ML Beginners (Systems Programmers New to ML)

🔧 Implementation-Focused (Experienced ML Engineers)

⚡ Performance Optimization

🦀 Zig-Specific Patterns (Rust/C Programmers)

📐 Architecture & Design Patterns

🛠️ Adding a New Model

Reading Order

How This Relates to the Code

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LLM Inference From Scratch

What You'll Learn

Prerequisites

Reading Paths

🎓 ML Beginners (Systems Programmers New to ML)

🔧 Implementation-Focused (Experienced ML Engineers)

⚡ Performance Optimization

🦀 Zig-Specific Patterns (Rust/C Programmers)

📐 Architecture & Design Patterns

🛠️ Adding a New Model

Reading Order

How This Relates to the Code