Skip to content

Latest commit

 

History

History
112 lines (89 loc) · 8.45 KB

File metadata and controls

112 lines (89 loc) · 8.45 KB

LLM Inference From Scratch

A tutorial series that builds understanding of LLM inference from the ground up. Written for systems programmers who want to understand how LLMs work under the hood — no AI/ML background required.

Each chapter introduces one major concept at a time, explaining both the why (the problem being solved) and the how (the algorithm and implementation). All AI/ML terminology is explained inline when first mentioned.

What You'll Learn

By the end of this tutorial, you'll understand:

  • How text becomes numbers: Tokenization, embeddings, and vocabulary projection
  • The transformer architecture: Attention mechanisms, position encoding, residual connections, and normalization
  • Feed-forward networks: Activation functions, gated linear units, and Mixture of Experts (MoE)
  • Quantization: How to compress 32-bit weights down to 4 bits (or less) while maintaining quality
  • Memory management: KV caching strategies (flat, paged, radix tree) and why they matter for performance
  • State space models: Linear-time alternatives to quadratic attention (DeltaNet, Mamba-2)
  • Sampling strategies: How temperature, top-k, top-p, and repeat penalty control randomness
  • Compute backends: How CPU, GPU (CUDA, Metal, Vulkan, ROCm, WebGPU) backends execute kernels and manage memory
  • Speculative decoding: Draft models, DDTree tree construction, self-speculative layer skipping

Prerequisites

  • Systems programming knowledge: Comfortable reading code that manages memory, writes tight loops, and thinks about cache locality
  • Basic linear algebra: If you've forgotten (or never learned) matrix-vector multiply, dot products, etc., see the Math Reference — we explain everything you need
  • No ML background needed: We explain transformers, attention, embeddings, etc. from first principles

If you can read Zig, C, or Rust code and understand concepts like "cache line" and "SIMD", you're ready.

Reading Paths

Different readers have different goals. Here are recommended paths through the tutorials:

🎓 ML Beginners (Systems Programmers New to ML)

Start from the beginning and read sequentially. Chapters 1-8 build understanding from first principles:

  1. Chapters 1-4 — Core concepts (tokens, transformers, quantization)
  2. Chapters 5-8 — Advanced concepts (caching, SSMs, sampling, backends)
  3. Chapters 9-16 — Implementation patterns (SIMD, memory safety, backend internals)

🔧 Implementation-Focused (Experienced ML Engineers)

You already know transformers and attention — jump straight to implementation:

Performance Optimization

Focus on chapters that explain speedup techniques:

🦀 Zig-Specific Patterns (Rust/C Programmers)

Learn Zig idioms used throughout the codebase:

📐 Architecture & Design Patterns

Understand how the codebase is structured:

🛠️ Adding a New Model

Everything you need to add a new architecture to Agave:

Reading Order

# Chapter What You'll Learn ~Time
1 Tokens and Text How text becomes numbers the model can process 7 min
2 The Transformer The core architecture: attention, position encoding, normalization 12 min
3 Feed-Forward Networks Activation functions, SwiGLU, MoE, megakernel fusion 5 min
4 Quantization Compressing weights from 32 bits to 4 bits; MLX, TurboQuant, PlanarQuant 15 min
5 Memory and Caching KV cache, PagedAttention, paged SDPA, RadixAttention 10 min
6 State Space Models Linear-time alternatives to attention: DeltaNet and Mamba-2 7 min
7 Sampling Temperature, top-k, top-p, min-p, penalties, grammar constraints 5 min
8 Backends CPU, CUDA, Metal, Vulkan, ROCm, WebGPU — dispatchers and paged SDPA 8 min
9 CPU SIMD Optimization @Vector, @reduce, @mulAdd, multi-row batching, quantized GEMV 12 min
10 Memory Safety defer, errdefer, guaranteed cleanup, leak detection 8 min
11 Metal Backend Internals UMA, buffer caching, command buffers, batch mode, threadgroup limits 12 min
12 CPU Parallelism Futex-based thread pool, work-stealing, atomic counters 10 min
13 Batched Dispatch and Fusion gemvMulti, fused ops, megakernel system (3 tiers) 16 min
14 Format Conventions GGUF vs SafeTensors differences, tensor layout, metadata mapping 12 min
15 Chat Templates Data-driven role markers, EOG tokens, multi-turn formatting 12 min
16 Recipe System Proven defaults per model+hardware, user override semantics 10 min
17 Speculative Decoding & DDTree Draft models, tree-structured verification, self-speculative layer skip 8 min

Appendices:

How This Relates to the Code

Each chapter references the Agave source files that implement the concepts discussed. The code follows the same layered structure as these tutorials — understanding the concepts makes the code straightforward to read.

For product documentation (project structure, module reference, supported models), see:

  • Architecture — project structure and module reference
  • Models — supported models and performance benchmarks
  • Kernel Status — per-backend kernel implementation status