Skip to content

maci0/agave

Repository files navigation

Agave

A high-performance LLM inference engine written in Zig.
Zero external ML libraries — all kernels, quantization, and model logic from scratch.

Quick StartFeaturesContributingDocs


Features

  • 7 Model Architectures: Gemma 3, Gemma 4, Qwen 3.5, GPT-OSS, Nemotron-H, Nemotron Nano, GLM-4
  • 6 Backends: CPU (SIMD-optimized), Metal GPU (Apple Silicon), Vulkan, CUDA, ROCm, WebGPU — individually toggleable at build time
  • Compile-Time Model Selection: Disable unused model architectures to reduce binary size (1.8 MB → 0.75 MB with all models stripped)
  • 2 Formats: GGUF, SafeTensors (multi-shard, MLX quantized, NVFP4)
  • 20+ Quantization Types: F32, F16, BF16, Q2_K, Q3_K, Q4_0, Q4_1, Q4_K, Q5_0, Q5_K, Q6_K, Q8_0, TQ1_0, IQ4_XS, IQ4_NL, FP8 E4M3, FP8 E5M2, NVFP4, MXFP4, MLX 4/6/8-bit, PlanarQuant, IsoQuant, RotorQuant
  • 18 KV Cache Quantization Types: F32, F16, Q8_0, INT8, FP8, NVFP4, TurboQuant 2/3/4-bit, PlanarQuant 2/3/4-bit, IsoQuant 2/3/4-bit, RotorQuant 2/3/4-bit — with asymmetric K/V support and paged SDPA
  • Tiered KV Cache: VRAM + RAM + SSD offloading with async prefetch (--kv-tiers vram+ram+ssd)
  • Chat Templates: Data-driven per-architecture prompt formatting (ChatML, Gemma, Gemma 4, Qwen 3.5, GLM-4, GPT-OSS)
  • Recipes: Optional proven-default configs per model/hardware/quant combo
  • Model Download: agave pull <org/repo> — download GGUF models from HuggingFace Hub with auto quant selection
  • Interactive REPL: Multi-turn chat with /help, /clear, /stats, /model, /quit
  • HTTP Server: OpenAI + Anthropic API compatible, built-in chat UI, Prometheus metrics, rate limiting
  • Multimodal Vision: Image understanding via Gemma 4 SigLIP-2 and Gemma 3 SigLIP vision encoders — image upload via CLI (--image) and HTTP API
  • Structured Output: GBNF grammar (--grammar-string, --grammar), JSON schema (--json-schema), JSON mode (--json-output), server response_format: json_object/json_schema
  • Full Sampling: temperature, top-k, top-p, min-p, repeat/frequency/presence penalties, seed, stop sequences
  • Batched Prefill: Chunked GEMM + fused FlashAttention-2 for fast prompt processing
  • Distributed Inference: Tensor parallelism (TP), pipeline parallelism (PP), disaggregated prefill/decode. Same-node multi-GPU via POSIX shm (zero-copy IPC), cross-node via TCP. Heterogeneous: mix CUDA + Vulkan + CPU across x86_64 + aarch64
  • Speculative Decoding: Draft model, self-speculative (layer skip), DDTree with configurable tree budget
  • Fused Megakernels: Composable GPU megakernels — gate+up+SiLU fused into single dispatch (3→1)
  • ~183 tok/s on Qwen3.5 0.8B Q8_0 (Metal, Apple Silicon M4 Pro), 1.2-1.7x faster than llama.cpp on Q8_0 (Q4_K performance is a known gap — active optimization target)

Quick Start

# Build (produces both ReleaseFast and Debug binaries)
zig build

# Download a model from HuggingFace
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF

# Interactive REPL
./zig-out/bin/agave model.gguf

# Single prompt
./zig-out/bin/agave model.gguf "What is the capital of France?"

# HTTP server
./zig-out/bin/agave model.gguf --serve

# Quiet mode (pipe-friendly, no banner/stats)
./zig-out/bin/agave model.gguf -q "Hello" > output.txt

# Force CPU backend
./zig-out/bin/agave model.gguf --backend cpu

# SafeTensors directory (MLX models)
./zig-out/bin/agave models/mlx-community/gemma-3-4b-it-qat-4bit

# TurboQuant KV cache (2/3/4-bit quantization for longer contexts)
./zig-out/bin/agave model.gguf --kv-type turbo4

# KV cache eviction (extend context past --ctx-size limit)
./zig-out/bin/agave model.gguf --kv-eviction norm --kv-budget 2048
./zig-out/bin/agave model.gguf --kv-eviction tri   # requires .cal file

# Generate TriAttention calibration data
./zig-out/bin/agave calibrate model.gguf

# Vision: describe an image (requires mmproj or built-in vision encoder)
./zig-out/bin/agave model.gguf --image photo.png "Describe this image"

# Override recipe defaults (user flags always win)
./zig-out/bin/agave model.gguf -t 0.9 --top-p 0.95 "Tell me a story"

# Structured output: force JSON
./zig-out/bin/agave model.gguf --json-output "Generate a user profile with name and age"

# Grammar-constrained decoding (GBNF format)
./zig-out/bin/agave model.gguf --grammar-string 'root ::= "yes" | "no"' "Is the sky blue?"

# JSON schema → structured output
./zig-out/bin/agave model.gguf --json-schema '{"type":"object","properties":{"name":{"type":"string"}}}' "User info"

# Sampling parameters
./zig-out/bin/agave model.gguf -t 0.7 --top-p 0.9 --min-p 0.05 "Tell me a story"

# GPU device selection
./zig-out/bin/agave model.gguf --list-devices                      # Show available GPUs
./zig-out/bin/agave model.gguf --backend vulkan --device 1          # Use second GPU

# Speculative decoding
./zig-out/bin/agave target.gguf --draft-model draft.gguf "prompt"   # Separate draft model
./zig-out/bin/agave model.gguf --spec-mode self --draft-layers 9    # Self-speculative
./zig-out/bin/agave model.gguf --spec-mode ddtree "prompt"          # DDTree self-draft

# Fused megakernel (3→1 GPU dispatch for FFN)
./zig-out/bin/agave model.gguf --megakernel "prompt"

Distributed Inference

Split models across multiple GPUs or machines via tensor parallelism (TP) and pipeline parallelism (PP).

# Same-node multi-GPU (shared memory IPC, zero-copy)
# Terminal 1: rank 0 on GPU 0
./zig-out/bin/agave model.gguf --backend vulkan --device 0 --pp 2 --rank 0 --peers localhost "prompt"
# Terminal 2: rank 1 on GPU 1
./zig-out/bin/agave model.gguf --backend vulkan --device 1 --pp 2 --rank 1 --peers localhost "prompt"

# Cross-node pipeline parallelism (TCP transport)
# Machine A (first half of layers):
./zig-out/bin/agave model.gguf --backend cuda --pp 2 --rank 0 --peers 192.168.0.2 "prompt"
# Machine B (second half + logits):
./zig-out/bin/agave model.gguf --backend cpu --pp 2 --rank 1 --peers 192.168.0.1 "prompt"

# Distributed tensor parallelism (weight sharding + all-reduce)
# Machine A:
./zig-out/bin/agave model.gguf --tp 2 --rank 0 --peers 192.168.0.2 "prompt"
# Machine B:
./zig-out/bin/agave model.gguf --tp 2 --rank 1 --peers 192.168.0.1 "prompt"

Supports heterogeneous setups: different backends (CUDA + Vulkan + CPU), architectures (aarch64 + x86_64), and GPU vendors (NVIDIA + AMD) in the same cluster. When --peers is localhost or 127.0.0.1, POSIX shared memory is used instead of TCP for zero-copy IPC.

Supported Models

Model Sizes Status Quant Types Notes
Gemma 3 1B, 4B, 12B, 27B Working BF16, Q8_0, Q4_0, Q4_K, Q5_K, Q6_K, MLX 4-bit SPM tokenizer, GELU activation, batched prefill
Gemma 4 E2B, E4B, 26B-A4B Working Q8_0, Q4_K, MLX 4-bit MoE (top-8), channel-based chat template, multimodal vision (SigLIP-2)
Qwen 3.5 0.8B, 9B, 27B Working Q4_0, Q4_K_M, Q8_0, BF16, MLX 4-bit Hybrid DeltaNet SSM + attention
GPT-OSS 20B Partial Q4_0 MoE, sliding window, attention sinks (poor output quality)
Nemotron-H Partial Q5_0 Mamba-2 + attention hybrid, GGUF (poor output quality)
Nemotron Nano 30B Partial MLX 4-bit, NVFP4 SSM + MoE + attention hybrid, SafeTensors (poor output quality)
GLM-4 MoE Lite 4.7B Partial MLX 4/6/8-bit MLA + MoE (GGUF compatibility issue, poor output quality)

Model Download

Download GGUF models from HuggingFace Hub with automatic quantization selection:

# Download best available quantization (prefers Q4_K_M)
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF

# Request specific quantization
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF --quant Q8_0

# List available GGUF files without downloading
./zig-out/bin/agave pull Qwen/Qwen3.5-0.8B-GGUF --list

# Private repos
HF_TOKEN=hf_xxxxx ./zig-out/bin/agave pull org/private-model

Downloads are stored in the standard HuggingFace cache layout with an agave convenience symlink. Supports resume on interrupted downloads.

Calibration

Generate TriAttention calibration data for frequency-domain KV eviction:

# Run calibration (produces model.cal alongside model.gguf)
./zig-out/bin/agave calibrate model.gguf

The calibration pass records per-head Q/K frequency statistics used by the --kv-eviction tri policy. See docs/ARCHITECTURE.md for details.

HTTP Server

Start with --serve. Supports both synchronous JSON and SSE streaming.

./zig-out/bin/agave model.gguf --serve --api-key sk-mykey

API Endpoints:

Endpoint Method Description
/v1/chat/completions POST OpenAI chat completion API
/v1/completions POST OpenAI text completion API
/v1/messages POST Anthropic Messages API
/v1/responses POST OpenAI Responses API
/v1/models GET List loaded models
/v1/embeddings POST Embedding generation (stub — returns 501)
/v1/chat POST Built-in web chat UI
/v1/chat/regenerate POST Regenerate last assistant response
/v1/conversations GET, POST Conversation management
/health GET Health check
/ready GET Readiness check
/metrics GET Prometheus metrics

Server features: up to 64 concurrent connections, request scheduler (batch up to 8, 120s timeout), 30s connection read timeout, rate limiting, Bearer token auth, CORS support.

Interactive REPL

Launch without a prompt argument for multi-turn chat:

./zig-out/bin/agave model.gguf

Commands:

Command Description
/clear, /reset Clear conversation history and KV cache
/context, /ctx Show context window usage (tokens used / max)
/system <text> Set system prompt (clears conversation)
/system Show current system prompt
/stats Toggle generation statistics display
/verbose Toggle technical details (params, EOG tokens)
/debug Toggle debug logging (token IDs, layer timing)
/model Show model information
/help Show REPL help
/quit, /exit, /q Exit

Keyboard shortcuts: Ctrl+C cancel, Ctrl+D quit, Ctrl+L clear screen, Ctrl+R reverse search.

Benchmarks

Measured on Apple M4 Pro (48 GB unified memory). See docs/BENCHMARKS.md for full methodology.

Model Quant Backend Decode (tok/s) vs llama.cpp
Qwen3.5 0.8B Q8_0 Metal 183.3 1.31x
Qwen3.5 9B Q8_0 Metal 41.7 1.67x
Gemma 3 4B MLX-Q4 Metal 78.1
Gemma 3 12B Q8_0 Metal 22.3 1.19x
Gemma 4 E2B Metal 9.7
Gemma 4 E4B Metal 8.5
Gemma 4 26B-A4B Metal 5.0
Gemma 3 27B QAT 4-bit Metal 11.6
Qwen3.5 0.8B MLX-4bit Metal 12.7

Multi-Backend (Qwen3.5 0.8B Q8_0)

Backend Hardware Decode (tok/s)
Metal Apple M4 Pro 129
ROCm AMD RX 7900 XTX 50.8
CPU Ryzen 9 9950X (32T) 44
CUDA NVIDIA GB10 (aarch64) 35
Vulkan AMD RX 7900 XTX 2.7

Distributed Inference (dual NVIDIA GB10 over RoCE RDMA)

Model Config Transport Decode (tok/s)
9B Q8_0 Single GPU 9.1
9B Q8_0 PP=2 NCCL RoCE 8.5
9B Q8_0 TP=2 NCCL RoCE 5.1
9B Q8_0 TP=2 TCP RoCE 4.9
27B Q4_K_M Single GPU 2.2
27B Q4_K_M PP=2 NCCL RoCE 2.2
27B Q4_K_M TP=2 NCCL RoCE 1.7

All quant formats supported on all backends: Q8_0 (GPU), Q4_0/Q4_K/Q5_K/Q6_K (GPU or CPU fallback on UMA). See docs/KERNELS.md for details.

Prerequisites

  • Zig 0.16.0
  • macOS (Metal backend) / Linux (Vulkan, CUDA, ROCm) / any platform (CPU, WebGPU backends)
  • GPU backends load drivers at runtime via dlopen — no SDK needed at build time

CLI Options

agave [OPTIONS] <model> [prompt]

  -h, --help               Show help
  -v, --version            Print version
  -q, --quiet              Suppress banner and stats
  -s, --serve              Start HTTP server
  -p, --port <PORT>        Server port [default: 49453]
  -n, --max-tokens <N>     Max tokens to generate [default: 512]
  -t, --temperature <T>    Sampling temperature, 0 = greedy [default: 0]
      --top-p <P>          Nucleus sampling threshold [default: 1.0]
      --top-k <K>          Top-k sampling, 0 = disabled [default: 0]
      --min-p <P>          Min-p sampling threshold [default: 0]
      --repeat-penalty <R> Repetition penalty [default: 1.0]
      --dry-multiplier <M> DRY n-gram repetition penalty [default: 0]
      --xtc-probability <P> XTC diversity sampling [default: 0]
      --mirostat-mode <N>  Mirostat target-entropy sampling: 0=off, 2=on [default: 0]
      --system <TEXT>      System prompt for chat formatting
      --backend <BE>       auto, cpu, metal, vulkan, cuda, rocm, webgpu [default: auto]
      --ctx-size <N|auto>  Context window size [default: min(model, 4096), 0 = model max, auto = fit to memory]
      --seed <N>           Random seed for sampling [default: random]
      --grammar <FILE>     GBNF grammar file for constrained decoding
      --grammar-string <G> Inline GBNF grammar string
      --json-schema <S>    JSON schema for structured output
      --json-output        Force valid JSON object output
      --kv-type <TYPE>     KV cache quantization: f32, f16, q8_0/q8, int8/i8, fp8/fp8_e4m3, nvfp4/fp4, turbo2/tq2, turbo3/tq3, turbo4/tq4, planar2-4/pq2-4, iso2-4/iq2-4, rotor2-4/rq2-4 [default: f16]
      --kv-tiers <TIERS>   Enable tiered KV cache: vram+ram, vram+ram+ssd [default: off]
      --kv-ram-budget <GB> RAM tier budget in GB, requires --kv-tiers [default: 50% of free RAM]
      --kv-ssd-path <PATH> SSD tier file path, requires --kv-tiers with ssd
      --kv-ssd-budget <GB> SSD tier budget in GB, requires --kv-tiers with ssd [default: 10]
      --host <ADDR>        Server bind address [default: 127.0.0.1]
      --api-key <KEY>      API key for server authentication (Bearer token)
      --prefill-batch-size <N> Prefill chunk size in tokens [default: 512]
      --no-color           Disable colored output (same as --color=never)
      --color <MODE>       Color mode: auto, always, never [default: auto]
      --kv-type-k <TYPE>   KV key quantization (overrides --kv-type)
      --kv-type-v <TYPE>   KV value quantization (overrides --kv-type)
  -V, --verbose            Show technical details (params, load times, EOG)
      --allow-cpu-fallback Allow GPU backends to fall back to CPU
  -d, --debug              Enable debug logging (token IDs, layer timing)
      --json               Output results as JSON (implies --quiet)
      --model-info         Print model metadata and exit (combine with --json)
      --profile            Profile per-op timing (halves throughput)
      --benchmark          Run decode benchmark with built-in prompt
      --mmproj <PATH>      Path to vision projector GGUF (mmproj file)
      --image <PATH>       Path to image file for multimodal inference (PNG or PPM)
      --kv-eviction <MODE> KV cache eviction policy: none, norm, tri [default: none]
      --kv-budget <N>      Max KV entries to retain after eviction [default: 80% of ctx-size]
      --mmap               Use lazy mmap instead of preloading weights into RAM
      --megakernel         Enable fused FFN megakernels (3→1 dispatch per layer)
      --draft-model <PATH> Draft model GGUF for speculative decoding
      --spec-mode <MODE>   Speculative mode: standard, ddtree, self, ngram [default: ddtree]
                           ngram uses output history (no draft model needed)
  -K, --spec-tokens <N>    Draft tokens per speculation round [default: 5]
      --tree-budget <N>    DDTree node budget [default: 64]
      --draft-layers <N>   Layers for self-speculative draft [default: auto]
      --list-devices       List available compute devices and exit
      --device <N>         GPU device index for CUDA/ROCm/Vulkan [default: 0]
      --tp <N>             Tensor parallelism degree [default: 1]
      --pp <N>             Pipeline parallelism stages [default: 1]
      --peers <ADDR>       Peer address for distributed inference
      --rank <N>           This node's rank [default: 0]
      --transport <TYPE>   IPC transport: auto, tcp, shm, nccl [default: auto]
      --disagg             Disaggregated prefill/decode

Build Options

All backends and models are enabled by default. Disable individually to reduce binary size or avoid unwanted dependencies.

# Disable specific backends
zig build -Denable-vulkan=false
zig build -Denable-cuda=false -Denable-rocm=false

# CPU-only build (no GPU backends)
zig build -Denable-metal=false -Denable-vulkan=false -Denable-cuda=false -Denable-rocm=false -Denable-webgpu=false

# GPU-only (disable CPU fallback — compile error if GPU init fails)
zig build -Denable-cpu=false

# Disable specific model architectures
zig build -Denable-glm4=false

# Minimal build: single model (Gemma 3) + single backend (Metal)
zig build -Denable-gemma4=false -Denable-qwen35=false -Denable-gpt-oss=false \
  -Denable-nemotron-h=false -Denable-nemotron-nano=false -Denable-glm4=false \
  -Denable-vulkan=false -Denable-cuda=false -Denable-rocm=false -Denable-webgpu=false

# Override GPU architecture targets
zig build -Dcuda-sm=sm_120        # Blackwell
zig build -Drocm-arch=gfx942      # MI300X

# Cross-compile
zig build -Dtarget=aarch64-linux-gnu -Denable-metal=false

Backend Options:

Option Type Default Purpose
enable-cpu bool true CPU backend
enable-metal bool true Metal backend (macOS only)
enable-vulkan bool true Vulkan backend (runtime dlopen)
enable-cuda bool true CUDA backend (runtime dlopen)
enable-rocm bool true ROCm backend (runtime dlopen)
enable-webgpu bool true WebGPU backend (WGSL shaders)
cuda-sm enum sm_90 CUDA SM target (sm_50..sm_120)
rocm-arch enum gfx1100 ROCm GFX target (gfx90a..gfx1151)

Model Options:

Option Type Default Purpose
enable-gemma3 bool true Gemma 3 model support
enable-gemma4 bool true Gemma 4 model support
enable-qwen35 bool true Qwen 3.5 model support
enable-gpt-oss bool true GPT-OSS model support
enable-nemotron-h bool true Nemotron-H model support
enable-nemotron-nano bool true Nemotron Nano model support
enable-glm4 bool true GLM-4 model support

Recipes

Recipes are optional preset configurations matched by architecture + backend + quantization. They provide proven defaults (temperature, top-p, context size, etc.) while allowing full user override via CLI flags.

# Recipe auto-applied, shown in banner:
🌵 agave Qwen3.5-0.8B Q4_0 Metal 32L/4096E/16H (45ms)
recipe: Qwen3.5 Q4 Metal

# User flags always take priority over recipe defaults:
./zig-out/bin/agave model.gguf -t 0  # overrides recipe temperature

Current presets: Qwen3.5 Q4 Metal, Gemma Q4 Metal, GPT-OSS Metal, GLM-4 generic, CPU generic. Add new recipes in src/recipe.zig.

Project Structure

src/
├── main.zig           # CLI, format detection, model init, REPL, recipe application
├── arch.zig           # Architecture enum, detection, chat template mapping
├── pull.zig           # Model download from HuggingFace Hub (agave pull)
├── server/            # HTTP server
│   ├── server.zig     #   HTTP server (OpenAI + Anthropic API + chat UI)
│   ├── json.zig       #   Streaming JSON parser for API requests
│   ├── scheduler.zig  #   Continuous batching request scheduler
│   ├── metrics.zig    #   Prometheus metrics collector
│   └── rate_limiter.zig #  Token bucket rate limiter
├── display.zig        # Rich CLI output (banner, stats, progress)
├── chat_template.zig  # Data-driven chat prompt templates (ChatML, Gemma, Gemma4, Qwen3.5, GLM-4, GPT-OSS)
├── recipe.zig         # Optional preset configs per model/hardware/quant combo
├── thread_pool.zig    # Futex-based work-stealing thread pool
├── image.zig          # PNG/PPM image decoder and resize for multimodal inference
├── perf.zig           # Performance timer utilities
├── readline.zig       # Line editor for interactive REPL
├── micro_bench.zig    # Standalone micro-benchmark binary
├── format/            # Weight file loaders
│   ├── format.zig     #   Format interface (getTensor, getMetaStr, ...)
│   ├── gguf.zig       #   GGUF v2/v3 with mmap
│   └── safetensors.zig#   Multi-shard SafeTensors + config.json
├── models/            # Model architectures
│   ├── model.zig      #   Model interface + shared helpers (expertWeightStride, etc.)
│   ├── gemma3.zig     #   Gemma 3 (GQA, GELU, post-norms)
│   ├── gemma4.zig     #   Gemma 4 (MoE, dual attention, PLE)
│   ├── qwen35.zig     #   Qwen 3.5 (DeltaNet SSM hybrid)
│   ├── gpt_oss.zig    #   GPT-OSS (MoE, sliding window)
│   ├── nemotron_h.zig #   Nemotron-H (Mamba-2 hybrid, GGUF)
│   ├── nemotron_nano.zig # Nemotron Nano (SSM+MoE+attn, SafeTensors NVFP4)
│   ├── glm4.zig       #   GLM-4 MoE Lite (MLA, MoE)
│   └── vision.zig     #   SigLIP-2 vision encoder for multimodal models
├── ops/               # Shared compute kernels
│   ├── attention.zig  #   SDPA with SIMD + sliding window + backend dispatch
│   ├── math.zig       #   argmax, softplus, sigmoid, GELU, sampleToken
│   ├── ssm.zig        #   SSM ops: causal conv1d, Mamba-2 recurrence, group norm+gate
│   ├── quant.zig      #   Quantization helpers (bf16, mxfp4, fp8, iq4nl, nvfp4_st)
│   ├── kv_quant.zig   #   KV cache quantization (f32/f16/q8_0/int8/fp8/nvfp4)
│   ├── mlx.zig        #   MLX 4/6/8-bit affine dequant
│   └── split_attention.zig # Split-attention: async CPU-GPU KV offloading
├── backend/           # Hardware backends (all individually toggleable)
│   ├── backend.zig    #   Tagged union dispatcher + NullBackend stub
│   ├── cpu.zig        #   CPU with SIMD (AVX2, NEON, SVE)
│   ├── metal.zig      #   Metal GPU (Apple Silicon)
│   ├── vulkan.zig     #   Vulkan GPU (runtime dlopen)
│   ├── cuda.zig       #   CUDA GPU (runtime dlopen, Zig PTX kernels)
│   ├── rocm.zig       #   ROCm GPU (runtime dlopen)
│   ├── webgpu.zig     #   WebGPU (WGSL shaders, browser + native)
│   ├── objc.zig       #   Objective-C runtime bridge for Metal
│   └── kernels/       #   GPU shader/kernel sources
│       ├── metal/     #     MSL compute shaders
│       ├── vulkan/    #     SPIR-V compute shaders
│       ├── cuda/      #     Zig CUDA kernels (compiled to PTX)
│       ├── rocm/      #     AMDGCN kernels (compiled to HSACO)
│       └── webgpu/    #     WGSL compute shaders
├── kvcache/
│   ├── manager.zig    #   KV cache alloc/free, PagedKvCache, RadixTree
│   ├── block_allocator.zig # Block allocation for paged KV cache
│   ├── tiered.zig     #   Tiered KV cache (VRAM + RAM + SSD)
│   └── prefetch.zig   #   Async block prefetching for tiered cache
└── tokenizer/
    ├── tokenizer.zig  #   Tokenizer interface
    └── bpe.zig        #   BPE + SPM tokenizer

research/kernels/          # Kernel research (not part of main build)
├── reference.py           #   PyTorch reference implementations
├── generate_golden.py     #   Golden test data generator
├── autotune.py            #   Benchmarking and optimization orchestrator
├── registry.py            #   Kernel registry and search spaces
└── golden/                #   Generated .bin files for Zig @embedFile

Docker

Build multi-platform images (x86_64 + aarch64) using docker buildx:

# Build for both platforms (all GPU backends enabled, glibc)
docker buildx build --platform linux/amd64,linux/arm64 -t agave .

# Build and load for current platform only
docker buildx build --load -t agave .

# CPU-only build (static musl binary, smaller image)
docker buildx build --load -t agave \
  --build-arg ENABLE_VULKAN=false \
  --build-arg ENABLE_CUDA=false \
  --build-arg ENABLE_ROCM=false .

# Minimal build: single model + CPU only
docker buildx build --load -t agave \
  --build-arg ENABLE_VULKAN=false \
  --build-arg ENABLE_CUDA=false \
  --build-arg ENABLE_ROCM=false \
  --build-arg ENABLE_QWEN35=false \
  --build-arg ENABLE_GPT_OSS=false \
  --build-arg ENABLE_NEMOTRON_H=false \
  --build-arg ENABLE_NEMOTRON_NANO=false \
  --build-arg ENABLE_GLM4=false .

# Run inference
docker run --rm -v /path/to/models:/models agave /models/model.gguf "Hello"

# Run HTTP server
docker run --rm -p 49453:49453 -v /path/to/models:/models agave /models/model.gguf --serve

# Override Zig version at build time
docker buildx build --build-arg ZIG_VERSION=0.16.0 -t agave .

GPU backends (CUDA, Vulkan, ROCm) load their drivers at runtime via dlopen, which requires glibc. When all three GPU backends are disabled, the build automatically switches to musl for a fully static binary. Zig cross-compiles natively — no QEMU emulation needed during build.

Static musl builds

For environments where a fully static, dependency-free binary is needed (Alpine containers, embedded systems, minimal distros), disable all dlopen backends:

# Static musl binary — CPU backend only
zig build -Dtarget=x86_64-linux-musl \
  -Denable-metal=false -Denable-vulkan=false \
  -Denable-cuda=false -Denable-rocm=false

# Cross-compile static ARM64 binary
zig build -Dtarget=aarch64-linux-musl \
  -Denable-metal=false -Denable-vulkan=false \
  -Denable-cuda=false -Denable-rocm=false

Note: Static musl builds only work with the CPU backend. GPU backends (CUDA, Vulkan, ROCm) depend on dlopen to load vendor drivers at runtime, which requires glibc. Attempting to dlopen a glibc-linked .so from a musl binary will segfault.

Documentation

License

GNU General Public License v3.0

About

A high-performance LLM inference engine written in Zig.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors