Appendix: Profiling and Debugging

Performance regressions are silent — the model still runs, but slower. Profiling makes performance visible. Agave has built-in instrumentation for dispatch counts, barriers, syncs, and per-operation timing.

--profile Flag

Enable profiling: Add --profile to any inference command.

./zig-out/bin/agave model.gguf --profile "Test prompt"

Output per token:

Token 15: "world" (151ms)
  embedLookup: 0.2ms
  layer 0: 8.1ms (rmsNorm: 0.3ms, gemv×3: 6.2ms, rope: 0.1ms, sdpa: 1.3ms, ...)
  layer 1: 8.0ms
  ...
  layer 31: 8.1ms
  final_norm: 0.3ms
  lm_head_gemv: 12.1ms

Metal counters:
  Dispatches: 994
  Barriers: 690
  Syncs: 1

What profiling adds:

Per-operation timing: Each gemv, rmsNorm, sdpa, etc. timed individually
Backend counters: Dispatch/barrier/sync counts (Metal, CUDA, ROCm)
Total time per layer: Aggregated time for each transformer layer

Cost of profiling: ~50% throughput loss due to additional GPU syncs (timing requires flushing command buffers).

When to use:

✅ Debugging performance regressions
✅ Identifying bottlenecks (which op is slow?)
✅ Verifying optimizations (did gemvMulti reduce dispatches?)
❌ Production inference (too slow)

Profiling Implementation

Timing Individual Operations

// src/perf.zig
pub const Op = enum {
    emb_lookup, rms_norm, gemv_qkv, gemv_out, gemv_ffn,
    deinterleave, rope, sdpa, sigmoid_mul, silu_mul,
    gelu_mul, add, deltanet, total_layer,
};

pub const PerfCounters = struct {
    counts: [n_ops]u64 = [_]u64{0} ** n_ops,
    times_us: [n_ops]u64 = [_]u64{0} ** n_ops,
    n_tokens: u64 = 0,
    enabled: bool = false,

    pub inline fn start(self: *PerfCounters) i128 {
        if (!self.enabled) return 0;
        return std.time.nanoTimestamp();
    }

    pub inline fn end(self: *PerfCounters, op: Op, t0: i128) void {
        if (!self.enabled) return;
        const elapsed: u64 = @intCast(@divFloor(std.time.nanoTimestamp() - t0, 1000));
        const idx = @intFromEnum(op);
        self.times_us[idx] += elapsed;
        self.counts[idx] += 1;
    }
};

Instrumented Operation

// In model forward(), e.g. src/models/qwen35.zig
var t = self.perf.start();

self.be.gemv(x, w, y, n, k);
self.be.sync();  // Flush GPU work (ensures timing is accurate)

self.perf.end(.gemv_qkv, t);

Key: GPU work is deferred. Without sync(), you'd measure only the CPU dispatch time (~5 µs), not the actual GPU execution time.

Trade-off: sync() per operation serializes execution → 50% throughput loss. This is why profiling is only enabled by the --profile flag.

Per-Layer Usage

Each operation in the forward pass is wrapped with start()/end():

// src/models/qwen35.zig — attention layer
var t = self.perf.start();
self.be.rmsNorm(x, w, eps);
self.perf.end(.rms_norm, t);

t = self.perf.start();
self.be.gemvMulti(qkv_ops);
self.perf.end(.gemv_qkv, t);

t = self.perf.start();
self.be.rope(q, k, freqs, pos);
self.perf.end(.rope, t);

t = self.perf.start();
self.be.sdpa(q, k, v, out, ...);
self.perf.end(.sdpa, t);

After generation completes, perf.report() prints a table with call counts, total time, average time, and percentage breakdown per operation.

Backend Dispatch Counters

Metal Counters

pub const MetalBackend = struct {
    dispatch_count: u32 = 0,
    barrier_count: u32 = 0,
    sync_count: u32 = 0,
    profile_counters: bool = false,
    // ...
};

fn encode(...) void {
    // ... dispatch kernel ...
    if (self.profile_counters) self.dispatch_count += 1;

    // ... insert barrier ...
    if (!self.batch_mode) {
        // ... barrier ...
        if (self.profile_counters) self.barrier_count += 1;
    }
}

fn flush() void {
    // ... commit command buffer ...
    if (self.profile_counters) self.sync_count += 1;
}

Reset per token:

pub fn resetCounters(self: *MetalBackend) void {
    if (self.profile_counters) {
        self.dispatch_count = 0;
        self.barrier_count = 0;
        self.sync_count = 0;
    }
}

Print at end of token:

if (g_profile) {
    const info = be.backendInfo();
    std.log.info("Metal: {d} dispatches, {d} barriers, {d} syncs",
        .{backend.metal.dispatch_count, backend.metal.barrier_count, backend.metal.sync_count});
}

Interpreting Counts

Dispatch count:

High (>1000): Many small kernels, dispatch overhead may dominate
Optimal (300-600): Batched/fused ops, minimal overhead
Too low (<100): Likely missing parallelism opportunities

Barrier count:

High (>1000): Serialized execution, GPU can't overlap work
Optimal (300-700): Batching used where possible
Too low (<100): Risky — may be missing necessary synchronization

Sync count:

High (>10): Excessive CPU/GPU round-trips, throughput loss
Optimal (1-3): Only at necessary points (argmax, embedding lookup)
Zero: Suspicious — CPU likely reading stale GPU data

Example: Qwen3.5 optimization reduced syncs from 18 → 1 per token (+15% throughput).

Missing Kernel Policy

Golden rule: GPU backends must never silently fall back to CPU. Missing kernels must @panic with a clear error message.

Enforcement

pub fn gemvMlxQ(self: *MetalBackend, x: [*]const f32, weight: [*]const u8, scales: [*]const u8, biases: [*]const u8, y: [*]f32, n: usize, k: usize, bits: u32) void {
    const pipeline = switch (bits) {
        4 => self.pipe_gemv_mlx_q4,
        8 => self.pipe_gemv_mlx_q8,
        6 => @panic("Metal MLX 6-bit GEMV not implemented — use --backend cpu or convert to 4-bit"),
        else => @panic("Unsupported MLX bit width"),
    };
    // ... dispatch ...
}

Error message requirements:

What's missing: "Metal MLX 6-bit GEMV not implemented"
Workaround: "use --backend cpu"
Alternative: "or convert to 4-bit"

Why @panic?

Alternative (silent fallback):

pub fn gemvMlxQ(...) void {
    if (bits == 6) {
        // Silently fall back to CPU
        self.be.sync();  // Flush GPU
        cpuGemvMlxQ(...);  // Run on CPU
        return;
    }
    // ... GPU path ...
}

Problem: User expects GPU performance, gets CPU performance, doesn't realize until they profile. Silent regressions are the worst kind.

With @panic:

$ ./agave model-6bit-mlx.gguf "Hello"
thread 1 panic: Metal MLX 6-bit GEMV not implemented — use --backend cpu or convert to 4-bit

User immediately knows there's an issue and has clear next steps.

CPU Fallback Exceptions

Only two cases allow CPU fallback:

1. embLookup (Single-Row Read)

pub fn embLookup(self: *MetalBackend, table: TensorData, token_id: u32, output: [*]f32, dim: usize) void {
    // Fallback to CPU: single-row lookup is faster on CPU than GPU dispatch overhead
    dequantToF32(table.data[token_id * rowBytes(dim, table.dtype) ..], output, dim);
}

Why CPU is faster:

GPU dispatch overhead: ~10 µs
Single-row dequant on CPU: ~2 µs (SIMD)
GPU would be faster for batch embedding lookup, but not single-token decode

2. Tiny Softmax (Below Threshold)

const softmax_cpu_threshold: usize = 128;

pub fn softmax(self: *MetalBackend, data: [*]f32, n: usize) void {
    if (n < softmax_cpu_threshold) {
        // CPU fallback: dispatch overhead dominates for tiny softmax
        cpuSoftmax(data, n);
        return;
    }
    // ... GPU path ...
}

Why threshold?

GPU dispatch: ~10 µs base cost
Softmax(128): ~2 µs on CPU SIMD
Softmax(1024): ~15 µs on CPU, ~3 µs on GPU (worth the dispatch)

Both exceptions are documented with comments explaining the performance justification.

Debugging Performance Regressions

Workflow

Establish baseline: Run with --profile on main branch

git checkout main
./zig-out/bin/agave model.gguf --profile "Test" > baseline.txt

Test change: Run with --profile on feature branch

git checkout feature
./zig-out/bin/agave model.gguf --profile "Test" > feature.txt

Compare:
```
diff baseline.txt feature.txt
```
Look for:
- Increased dispatch/barrier/sync counts
- Slower individual operations
- New operations (unexpected CPU fallbacks?)
Isolate: Comment out parts of the change to identify the culprit
Fix: Once identified, fix the regression
Verify: Re-run profile, confirm counters match baseline

Example: Identifying a Regression

Before (baseline):

Metal: 690 dispatches, 690 barriers, 1 sync
Token time: 71ms (14.1 tok/s)

After (regression):

Metal: 706 dispatches, 930 barriers, 17 syncs
Token time: 83ms (12.0 tok/s)

Analysis:

+16 dispatches → something new is being dispatched
+240 barriers → batching was removed somewhere
+16 syncs → major red flag — CPU/GPU round-trips added

Investigation: 16 syncs = 16 DeltaNet layers. Check DeltaNet code.

Root cause: Q/gate split moved from GPU kernel to CPU memcpy:

// REGRESSION: CPU memcpy requires sync before and after
self.be.sync();  // Sync 1 (GPU → CPU)
for (0..nh) |h| {
    @memcpy(...);  // CPU memcpy
}
// Next GPU op needs data → sync 2 (CPU → GPU)

Fix: Move split to GPU kernel (eliminates 16 syncs/token).

Tracy Integration

Agave doesn't currently use Tracy, but here's how you'd integrate it:

Build with Tracy

// build.zig
const tracy = b.dependency("tracy", .{});
exe.linkLibrary(tracy.artifact("tracy"));
exe.addCSourceFile(.{ .file = tracy.path("public/TracyClient.cpp"), .flags = &.{"-DTRACY_ENABLE"} });

Instrument Code

const tracy = @cImport(@cInclude("tracy/Tracy.hpp"));

pub fn gemv(...) void {
    const zone = tracy.ZoneScoped();
    defer tracy.ZoneEnd(zone);

    // ... operation ...
}

View Results

./tracy-profiler  # GUI shows flamegraph, GPU timelines, memory allocations

Benefits:

Visual timeline (see parallelism, gaps)
GPU queue visualization
Memory allocation tracking

Cost: ~5-10% overhead (lower than --profile because no forced syncs).

Common Profiling Patterns

Bottleneck Identification

./agave model.gguf --profile "Test" 2>&1 | grep "ms" | sort -rn -k2

Output (sorted by time):

  lm_head_gemv: 12.1ms
  layer 15: 8.2ms
  layer 0: 8.1ms
  gemv×3: 6.2ms
  sdpa: 1.3ms
  rmsNorm: 0.3ms

Interpretation: lm_head_gemv is the bottleneck (vocab projection, large matrix).

Regression Detection (CI)

# In CI pipeline
./agave model.gguf --profile "Test" > current.txt
./agave-baseline model.gguf --profile "Test" > baseline.txt

# Extract sync count
current_syncs=$(grep "syncs" current.txt | awk '{print $NF}')
baseline_syncs=$(grep "syncs" baseline.txt | awk '{print $NF}')

if [ "$current_syncs" -gt "$baseline_syncs" ]; then
    echo "Regression: sync count increased from $baseline_syncs to $current_syncs"
    exit 1
fi

Prevents: Silent performance regressions from merging.

Megakernel Profiling

Combining --profile with --megakernel shows the impact of kernel fusion on dispatch and barrier counts:

# Standard dispatch
./agave model.gguf --profile "Test"
# Metal: 994 dispatches, 690 barriers, 1 sync

# With fused FFN megakernel (Tier 1)
./agave model.gguf --profile --megakernel "Test"
# Metal: 946 dispatches, 642 barriers, 1 sync
# (48 fewer dispatches = 24 layers x 2 saved per FFN)

# With true megakernel (Tier 2, when available for model+quant)
# Metal: ~30 dispatches, ~30 barriers, 1 sync
# (entire layer runs as single dispatch)

True megakernels show the most dramatic reduction -- dispatch count drops from hundreds to roughly n_layers + overhead. This is because each layer becomes a single dispatch with internal mega_grid_sync atomic barriers replacing Metal memory barriers.

Comparative Profiling

# Compare two quantization formats
./agave model-q4.gguf --profile "Test" | grep "Token time"
./agave model-mlx.gguf --profile "Test" | grep "Token time"

# Compare two backends
./agave model.gguf --backend Metal --profile "Test" | grep "dispatches"
./agave model.gguf --backend CPU --profile "Test" | grep "layer 0"

Performance Debugging Checklist

When investigating slow performance:

Run with --profile to get baseline numbers
Check sync count (should be ≤3 per token)
Check dispatch count (should be 300-600 for typical model)
Identify slowest operation (sort profiling output)
Compare against expected performance (other quantization formats, backends)
Check for unexpected CPU fallbacks (CPU time in GPU-expected ops)
Verify batching is used (gemvMulti, rmsNormMulti, etc.)
Check for missing fusion opportunities (sequential ops that could be fused)

Best Practices

Development

Profile before optimizing: Measure first, optimize second
One change at a time: Isolate what caused the improvement/regression
Keep baseline numbers: Document expected performance for each model+backend combo

CI/CD

Benchmark on merge: Run performance suite on every PR
Regression threshold: Fail CI if throughput drops >5%
Track over time: Graph performance trends (detect gradual degradation)

Production

Never use --profile in production: 50% throughput loss
Use metrics instead: Log tokens/sec, TTFT, latency percentiles
A/B test optimizations: Roll out changes to subset of traffic first

In the code: src/perf.zig (profiling infrastructure), src/backend/metal.zig (dispatch counters), src/main.zig (--profile flag handling)

Next: Appendix: Atomic Operations → | Back: Appendix: Compile-Time Optimization ←

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appendix: Profiling and Debugging

--profile Flag

Profiling Implementation

Timing Individual Operations

Instrumented Operation

Per-Layer Usage

Backend Dispatch Counters

Metal Counters

Interpreting Counts

Missing Kernel Policy

Enforcement

Why @panic?

CPU Fallback Exceptions

1. embLookup (Single-Row Read)

2. Tiny Softmax (Below Threshold)

Debugging Performance Regressions

Workflow

Example: Identifying a Regression

Tracy Integration

Build with Tracy

Instrument Code

View Results

Common Profiling Patterns

Bottleneck Identification

Regression Detection (CI)

Megakernel Profiling

Comparative Profiling

Performance Debugging Checklist

Best Practices

Development

CI/CD

Production

FilesExpand file tree

appendix-profiling.md

Latest commit

History

appendix-profiling.md

File metadata and controls

Appendix: Profiling and Debugging

--profile Flag

Profiling Implementation

Timing Individual Operations

Instrumented Operation

Per-Layer Usage

Backend Dispatch Counters

Metal Counters

Interpreting Counts

Missing Kernel Policy

Enforcement

Why @panic?

CPU Fallback Exceptions

1. embLookup (Single-Row Read)

2. Tiny Softmax (Below Threshold)

Debugging Performance Regressions

Workflow

Example: Identifying a Regression

Tracy Integration

Build with Tracy

Instrument Code

View Results

Common Profiling Patterns

Bottleneck Identification

Regression Detection (CI)

Megakernel Profiling

Comparative Profiling

Performance Debugging Checklist

Best Practices

Development

CI/CD

Production