Performance regressions are silent — the model still runs, but slower. Profiling makes performance visible. Agave has built-in instrumentation for dispatch counts, barriers, syncs, and per-operation timing.
Enable profiling: Add --profile to any inference command.
./zig-out/bin/agave model.gguf --profile "Test prompt"Output per token:
Token 15: "world" (151ms)
embedLookup: 0.2ms
layer 0: 8.1ms (rmsNorm: 0.3ms, gemv×3: 6.2ms, rope: 0.1ms, sdpa: 1.3ms, ...)
layer 1: 8.0ms
...
layer 31: 8.1ms
final_norm: 0.3ms
lm_head_gemv: 12.1ms
Metal counters:
Dispatches: 994
Barriers: 690
Syncs: 1
What profiling adds:
- Per-operation timing: Each gemv, rmsNorm, sdpa, etc. timed individually
- Backend counters: Dispatch/barrier/sync counts (Metal, CUDA, ROCm)
- Total time per layer: Aggregated time for each transformer layer
Cost of profiling: ~50% throughput loss due to additional GPU syncs (timing requires flushing command buffers).
When to use:
- ✅ Debugging performance regressions
- ✅ Identifying bottlenecks (which op is slow?)
- ✅ Verifying optimizations (did gemvMulti reduce dispatches?)
- ❌ Production inference (too slow)
// src/perf.zig
pub const Op = enum {
emb_lookup, rms_norm, gemv_qkv, gemv_out, gemv_ffn,
deinterleave, rope, sdpa, sigmoid_mul, silu_mul,
gelu_mul, add, deltanet, total_layer,
};
pub const PerfCounters = struct {
counts: [n_ops]u64 = [_]u64{0} ** n_ops,
times_us: [n_ops]u64 = [_]u64{0} ** n_ops,
n_tokens: u64 = 0,
enabled: bool = false,
pub inline fn start(self: *PerfCounters) i128 {
if (!self.enabled) return 0;
return std.time.nanoTimestamp();
}
pub inline fn end(self: *PerfCounters, op: Op, t0: i128) void {
if (!self.enabled) return;
const elapsed: u64 = @intCast(@divFloor(std.time.nanoTimestamp() - t0, 1000));
const idx = @intFromEnum(op);
self.times_us[idx] += elapsed;
self.counts[idx] += 1;
}
};// In model forward(), e.g. src/models/qwen35.zig
var t = self.perf.start();
self.be.gemv(x, w, y, n, k);
self.be.sync(); // Flush GPU work (ensures timing is accurate)
self.perf.end(.gemv_qkv, t);Key: GPU work is deferred. Without sync(), you'd measure only the CPU dispatch time (~5 µs), not the actual GPU execution time.
Trade-off: sync() per operation serializes execution → 50% throughput loss. This is why profiling is only enabled by the --profile flag.
Each operation in the forward pass is wrapped with start()/end():
// src/models/qwen35.zig — attention layer
var t = self.perf.start();
self.be.rmsNorm(x, w, eps);
self.perf.end(.rms_norm, t);
t = self.perf.start();
self.be.gemvMulti(qkv_ops);
self.perf.end(.gemv_qkv, t);
t = self.perf.start();
self.be.rope(q, k, freqs, pos);
self.perf.end(.rope, t);
t = self.perf.start();
self.be.sdpa(q, k, v, out, ...);
self.perf.end(.sdpa, t);After generation completes, perf.report() prints a table with call counts, total time, average time, and percentage breakdown per operation.
pub const MetalBackend = struct {
dispatch_count: u32 = 0,
barrier_count: u32 = 0,
sync_count: u32 = 0,
profile_counters: bool = false,
// ...
};
fn encode(...) void {
// ... dispatch kernel ...
if (self.profile_counters) self.dispatch_count += 1;
// ... insert barrier ...
if (!self.batch_mode) {
// ... barrier ...
if (self.profile_counters) self.barrier_count += 1;
}
}
fn flush() void {
// ... commit command buffer ...
if (self.profile_counters) self.sync_count += 1;
}Reset per token:
pub fn resetCounters(self: *MetalBackend) void {
if (self.profile_counters) {
self.dispatch_count = 0;
self.barrier_count = 0;
self.sync_count = 0;
}
}Print at end of token:
if (g_profile) {
const info = be.backendInfo();
std.log.info("Metal: {d} dispatches, {d} barriers, {d} syncs",
.{backend.metal.dispatch_count, backend.metal.barrier_count, backend.metal.sync_count});
}Dispatch count:
- High (>1000): Many small kernels, dispatch overhead may dominate
- Optimal (300-600): Batched/fused ops, minimal overhead
- Too low (<100): Likely missing parallelism opportunities
Barrier count:
- High (>1000): Serialized execution, GPU can't overlap work
- Optimal (300-700): Batching used where possible
- Too low (<100): Risky — may be missing necessary synchronization
Sync count:
- High (>10): Excessive CPU/GPU round-trips, throughput loss
- Optimal (1-3): Only at necessary points (argmax, embedding lookup)
- Zero: Suspicious — CPU likely reading stale GPU data
Example: Qwen3.5 optimization reduced syncs from 18 → 1 per token (+15% throughput).
Golden rule: GPU backends must never silently fall back to CPU. Missing kernels must @panic with a clear error message.
pub fn gemvMlxQ(self: *MetalBackend, x: [*]const f32, weight: [*]const u8, scales: [*]const u8, biases: [*]const u8, y: [*]f32, n: usize, k: usize, bits: u32) void {
const pipeline = switch (bits) {
4 => self.pipe_gemv_mlx_q4,
8 => self.pipe_gemv_mlx_q8,
6 => @panic("Metal MLX 6-bit GEMV not implemented — use --backend cpu or convert to 4-bit"),
else => @panic("Unsupported MLX bit width"),
};
// ... dispatch ...
}Error message requirements:
- What's missing: "Metal MLX 6-bit GEMV not implemented"
- Workaround: "use --backend cpu"
- Alternative: "or convert to 4-bit"
Alternative (silent fallback):
pub fn gemvMlxQ(...) void {
if (bits == 6) {
// Silently fall back to CPU
self.be.sync(); // Flush GPU
cpuGemvMlxQ(...); // Run on CPU
return;
}
// ... GPU path ...
}Problem: User expects GPU performance, gets CPU performance, doesn't realize until they profile. Silent regressions are the worst kind.
With @panic:
$ ./agave model-6bit-mlx.gguf "Hello"
thread 1 panic: Metal MLX 6-bit GEMV not implemented — use --backend cpu or convert to 4-bit
User immediately knows there's an issue and has clear next steps.
Only two cases allow CPU fallback:
pub fn embLookup(self: *MetalBackend, table: TensorData, token_id: u32, output: [*]f32, dim: usize) void {
// Fallback to CPU: single-row lookup is faster on CPU than GPU dispatch overhead
dequantToF32(table.data[token_id * rowBytes(dim, table.dtype) ..], output, dim);
}Why CPU is faster:
- GPU dispatch overhead: ~10 µs
- Single-row dequant on CPU: ~2 µs (SIMD)
- GPU would be faster for batch embedding lookup, but not single-token decode
const softmax_cpu_threshold: usize = 128;
pub fn softmax(self: *MetalBackend, data: [*]f32, n: usize) void {
if (n < softmax_cpu_threshold) {
// CPU fallback: dispatch overhead dominates for tiny softmax
cpuSoftmax(data, n);
return;
}
// ... GPU path ...
}Why threshold?
- GPU dispatch: ~10 µs base cost
- Softmax(128): ~2 µs on CPU SIMD
- Softmax(1024): ~15 µs on CPU, ~3 µs on GPU (worth the dispatch)
Both exceptions are documented with comments explaining the performance justification.
-
Establish baseline: Run with
--profileon main branchgit checkout main ./zig-out/bin/agave model.gguf --profile "Test" > baseline.txt
-
Test change: Run with
--profileon feature branchgit checkout feature ./zig-out/bin/agave model.gguf --profile "Test" > feature.txt
-
Compare:
diff baseline.txt feature.txt
Look for:
- Increased dispatch/barrier/sync counts
- Slower individual operations
- New operations (unexpected CPU fallbacks?)
-
Isolate: Comment out parts of the change to identify the culprit
-
Fix: Once identified, fix the regression
-
Verify: Re-run profile, confirm counters match baseline
Before (baseline):
Metal: 690 dispatches, 690 barriers, 1 sync
Token time: 71ms (14.1 tok/s)
After (regression):
Metal: 706 dispatches, 930 barriers, 17 syncs
Token time: 83ms (12.0 tok/s)
Analysis:
- +16 dispatches → something new is being dispatched
- +240 barriers → batching was removed somewhere
- +16 syncs → major red flag — CPU/GPU round-trips added
Investigation: 16 syncs = 16 DeltaNet layers. Check DeltaNet code.
Root cause: Q/gate split moved from GPU kernel to CPU memcpy:
// REGRESSION: CPU memcpy requires sync before and after
self.be.sync(); // Sync 1 (GPU → CPU)
for (0..nh) |h| {
@memcpy(...); // CPU memcpy
}
// Next GPU op needs data → sync 2 (CPU → GPU)Fix: Move split to GPU kernel (eliminates 16 syncs/token).
Agave doesn't currently use Tracy, but here's how you'd integrate it:
// build.zig
const tracy = b.dependency("tracy", .{});
exe.linkLibrary(tracy.artifact("tracy"));
exe.addCSourceFile(.{ .file = tracy.path("public/TracyClient.cpp"), .flags = &.{"-DTRACY_ENABLE"} });const tracy = @cImport(@cInclude("tracy/Tracy.hpp"));
pub fn gemv(...) void {
const zone = tracy.ZoneScoped();
defer tracy.ZoneEnd(zone);
// ... operation ...
}./tracy-profiler # GUI shows flamegraph, GPU timelines, memory allocationsBenefits:
- Visual timeline (see parallelism, gaps)
- GPU queue visualization
- Memory allocation tracking
Cost: ~5-10% overhead (lower than --profile because no forced syncs).
./agave model.gguf --profile "Test" 2>&1 | grep "ms" | sort -rn -k2Output (sorted by time):
lm_head_gemv: 12.1ms
layer 15: 8.2ms
layer 0: 8.1ms
gemv×3: 6.2ms
sdpa: 1.3ms
rmsNorm: 0.3ms
Interpretation: lm_head_gemv is the bottleneck (vocab projection, large matrix).
# In CI pipeline
./agave model.gguf --profile "Test" > current.txt
./agave-baseline model.gguf --profile "Test" > baseline.txt
# Extract sync count
current_syncs=$(grep "syncs" current.txt | awk '{print $NF}')
baseline_syncs=$(grep "syncs" baseline.txt | awk '{print $NF}')
if [ "$current_syncs" -gt "$baseline_syncs" ]; then
echo "Regression: sync count increased from $baseline_syncs to $current_syncs"
exit 1
fiPrevents: Silent performance regressions from merging.
Combining --profile with --megakernel shows the impact of kernel fusion on dispatch and barrier counts:
# Standard dispatch
./agave model.gguf --profile "Test"
# Metal: 994 dispatches, 690 barriers, 1 sync
# With fused FFN megakernel (Tier 1)
./agave model.gguf --profile --megakernel "Test"
# Metal: 946 dispatches, 642 barriers, 1 sync
# (48 fewer dispatches = 24 layers x 2 saved per FFN)
# With true megakernel (Tier 2, when available for model+quant)
# Metal: ~30 dispatches, ~30 barriers, 1 sync
# (entire layer runs as single dispatch)True megakernels show the most dramatic reduction -- dispatch count drops from hundreds to roughly n_layers + overhead. This is because each layer becomes a single dispatch with internal mega_grid_sync atomic barriers replacing Metal memory barriers.
# Compare two quantization formats
./agave model-q4.gguf --profile "Test" | grep "Token time"
./agave model-mlx.gguf --profile "Test" | grep "Token time"
# Compare two backends
./agave model.gguf --backend Metal --profile "Test" | grep "dispatches"
./agave model.gguf --backend CPU --profile "Test" | grep "layer 0"When investigating slow performance:
- Run with
--profileto get baseline numbers - Check sync count (should be ≤3 per token)
- Check dispatch count (should be 300-600 for typical model)
- Identify slowest operation (sort profiling output)
- Compare against expected performance (other quantization formats, backends)
- Check for unexpected CPU fallbacks (CPU time in GPU-expected ops)
- Verify batching is used (gemvMulti, rmsNormMulti, etc.)
- Check for missing fusion opportunities (sequential ops that could be fused)
- Profile before optimizing: Measure first, optimize second
- One change at a time: Isolate what caused the improvement/regression
- Keep baseline numbers: Document expected performance for each model+backend combo
- Benchmark on merge: Run performance suite on every PR
- Regression threshold: Fail CI if throughput drops >5%
- Track over time: Graph performance trends (detect gradual degradation)
- Never use --profile in production: 50% throughput loss
- Use metrics instead: Log tokens/sec, TTFT, latency percentiles
- A/B test optimizations: Roll out changes to subset of traffic first
In the code: src/perf.zig (profiling infrastructure), src/backend/metal.zig (dispatch counters), src/main.zig (--profile flag handling)
Related: Chapter 11: Metal Backend Internals, Chapter 13: Batched Dispatch and Fusion
Next: Appendix: Atomic Operations → | Back: Appendix: Compile-Time Optimization ←