A tutorial series that builds understanding of LLM inference from the ground up. Written for systems programmers who want to understand how LLMs work under the hood — no AI/ML background required.
Each chapter introduces one major concept at a time, explaining both the why (the problem being solved) and the how (the algorithm and implementation). All AI/ML terminology is explained inline when first mentioned.
By the end of this tutorial, you'll understand:
- How text becomes numbers: Tokenization, embeddings, and vocabulary projection
- The transformer architecture: Attention mechanisms, position encoding, residual connections, and normalization
- Feed-forward networks: Activation functions, gated linear units, and Mixture of Experts (MoE)
- Quantization: How to compress 32-bit weights down to 4 bits (or less) while maintaining quality
- Memory management: KV caching strategies (flat, paged, radix tree) and why they matter for performance
- State space models: Linear-time alternatives to quadratic attention (DeltaNet, Mamba-2)
- Sampling strategies: How temperature, top-k, top-p, and repeat penalty control randomness
- Compute backends: How CPU, GPU (CUDA, Metal, Vulkan, ROCm, WebGPU) backends execute kernels and manage memory
- Speculative decoding: Draft models, DDTree tree construction, self-speculative layer skipping
- Systems programming knowledge: Comfortable reading code that manages memory, writes tight loops, and thinks about cache locality
- Basic linear algebra: If you've forgotten (or never learned) matrix-vector multiply, dot products, etc., see the Math Reference — we explain everything you need
- No ML background needed: We explain transformers, attention, embeddings, etc. from first principles
If you can read Zig, C, or Rust code and understand concepts like "cache line" and "SIMD", you're ready.
Different readers have different goals. Here are recommended paths through the tutorials:
Start from the beginning and read sequentially. Chapters 1-8 build understanding from first principles:
- Chapters 1-4 — Core concepts (tokens, transformers, quantization)
- Chapters 5-8 — Advanced concepts (caching, SSMs, sampling, backends)
- Chapters 9-16 — Implementation patterns (SIMD, memory safety, backend internals)
You already know transformers and attention — jump straight to implementation:
- Chapter 9: CPU SIMD — @Vector patterns, multi-row batching
- Chapter 11: Metal Backend — GPU optimization on Apple Silicon
- Chapter 13: Batched Dispatch — Kernel fusion, dispatch reduction
- Appendix: Profiling — Performance debugging techniques
Focus on chapters that explain speedup techniques:
- Chapter 4: Quantization — MLX factored dequantization (30-40% speedup)
- Chapter 9: CPU SIMD — Multi-row GEMV batching (2-4× speedup)
- Chapter 13: Batched Dispatch — Qwen3.5 optimization journey (15% speedup)
- Appendix: Compile-Time — Lookup tables (20-30× for FP8 dequant)
Learn Zig idioms used throughout the codebase:
- Chapter 9: CPU SIMD — @Vector, @reduce, @mulAdd, @splat
- Chapter 10: Memory Safety — defer, errdefer, leak detection
- Chapter 12: CPU Parallelism — Futex-based thread pool, atomic operations
- Appendix: Compile-Time — comptime, @embedFile, inline else dispatch
- Appendix: Atomic Operations — Memory ordering, lock-free patterns
Understand how the codebase is structured:
- Chapter 8: Backends — Tagged union dispatch pattern
- Chapter 14: Format Conventions — GGUF vs SafeTensors differences
- Chapter 15: Chat Templates — Data-driven configuration
- Chapter 16: Recipe System — Per-model/hardware defaults
Everything you need to add a new architecture to Agave:
- Chapter 14: Format Conventions — Tensor naming, dimension order, format detection
- Chapter 15: Chat Templates — Prompt formatting and EOG tokens
- Chapter 16: Recipe System — Per-model defaults
- Chapter 8: Backends — Dispatcher pattern and kernel interface
- Chapter 13: Batched Dispatch — Tier 3 composed megakernels (auto-generated from ModelDesc)
| # | Chapter | What You'll Learn | ~Time |
|---|---|---|---|
| 1 | Tokens and Text | How text becomes numbers the model can process | 7 min |
| 2 | The Transformer | The core architecture: attention, position encoding, normalization | 12 min |
| 3 | Feed-Forward Networks | Activation functions, SwiGLU, MoE, megakernel fusion | 5 min |
| 4 | Quantization | Compressing weights from 32 bits to 4 bits; MLX, TurboQuant, PlanarQuant | 15 min |
| 5 | Memory and Caching | KV cache, PagedAttention, paged SDPA, RadixAttention | 10 min |
| 6 | State Space Models | Linear-time alternatives to attention: DeltaNet and Mamba-2 | 7 min |
| 7 | Sampling | Temperature, top-k, top-p, min-p, penalties, grammar constraints | 5 min |
| 8 | Backends | CPU, CUDA, Metal, Vulkan, ROCm, WebGPU — dispatchers and paged SDPA | 8 min |
| 9 | CPU SIMD Optimization | @Vector, @reduce, @mulAdd, multi-row batching, quantized GEMV | 12 min |
| 10 | Memory Safety | defer, errdefer, guaranteed cleanup, leak detection | 8 min |
| 11 | Metal Backend Internals | UMA, buffer caching, command buffers, batch mode, threadgroup limits | 12 min |
| 12 | CPU Parallelism | Futex-based thread pool, work-stealing, atomic counters | 10 min |
| 13 | Batched Dispatch and Fusion | gemvMulti, fused ops, megakernel system (3 tiers) | 16 min |
| 14 | Format Conventions | GGUF vs SafeTensors differences, tensor layout, metadata mapping | 12 min |
| 15 | Chat Templates | Data-driven role markers, EOG tokens, multi-turn formatting | 12 min |
| 16 | Recipe System | Proven defaults per model+hardware, user override semantics | 10 min |
| 17 | Speculative Decoding & DDTree | Draft models, tree-structured verification, self-speculative layer skip | 8 min |
Appendices:
- Mathematical Operations Reference — Quick reference for all math operations (dot product, softmax, GEMV, convolution, etc.)
- Compile-Time Optimization — comptime keyword, @embedFile, lookup tables, feature detection, type specialization
- Profiling and Debugging — --profile flag, dispatch counters, missing kernel policy, regression detection
- Atomic Operations and Memory Ordering — std.atomic.Value, memory ordering semantics, lock-free patterns
Each chapter references the Agave source files that implement the concepts discussed. The code follows the same layered structure as these tutorials — understanding the concepts makes the code straightforward to read.
For product documentation (project structure, module reference, supported models), see:
- Architecture — project structure and module reference
- Models — supported models and performance benchmarks
- Kernel Status — per-backend kernel implementation status