Skip to content

feat(inference): composable ModelSpec to IR to KernelPlan with PlanFingerprint zero-cost gate (ADR-059) #177

@ohdearquant

Description

@ohdearquant

Implements ADR-059 (Composable Layer Architecture) per the d4 design: ModelSpec → KernelGraph (IR) → fused KernelPlan tape. Dynamic traits exist only at construction/lowering; the decode loop executes a pre-lowered backend tape with zero dyn dispatch. North-star: architecture exploration without sacrificing hand-fused Metal speed.

Tasks

  • ModelSpec/AttentionSpec (TOML/JSON) — express current Qwen3.5 hybrid as interleave{full_attention_interval=4}; validation must allow head_dim 256 (not just 64/80/96/128)
  • minimal kernel-graph IR (RmsNorm/Linear/RoPE/Attention/SwiGLU/ResidualAdd/QuantizeKV/SampleTopK) + epilogue-fusion pass (Linear→Bias?→Act?→Mul?→Residual?)
  • AttentionVariant trait (validate/emit_ir/metal_kernel_key/template) — new variant adds 1 trait + 1 MSL template, no decode-loop edits
  • template-MSL + Metal function constants (avoid the 18K-source-variant explosion)

Acceptance — the zero-cost gate (the regression to prevent)

  • composed Qwen3.5 path lowers to the same PlanFingerprint as the hand path: identical kernel list, dispatch count, command-buffer count
  • decode tok/s ≥98.5% geo-mean (≥97% per ctx bucket); CI fails if plan contains GenericLinear/GenericAttention/HostCopy/UnfusedResidual
  • demo: swap RoPE→NoPE / GQA→GDN as config

Ref: d4 (whole), ADR-059. Can start in parallel after the ADR-064 baseline is frozen.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlattice-inferenceAffects the lattice-inference crate (transformer inference)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions