Implements ADR-059 (Composable Layer Architecture) per the d4 design: ModelSpec → KernelGraph (IR) → fused KernelPlan tape. Dynamic traits exist only at construction/lowering; the decode loop executes a pre-lowered backend tape with zero dyn dispatch. North-star: architecture exploration without sacrificing hand-fused Metal speed.
Tasks
Acceptance — the zero-cost gate (the regression to prevent)
- composed Qwen3.5 path lowers to the same
PlanFingerprint as the hand path: identical kernel list, dispatch count, command-buffer count
- decode tok/s ≥98.5% geo-mean (≥97% per ctx bucket); CI fails if plan contains
GenericLinear/GenericAttention/HostCopy/UnfusedResidual
- demo: swap RoPE→NoPE / GQA→GDN as config
Ref: d4 (whole), ADR-059. Can start in parallel after the ADR-064 baseline is frozen.
Implements ADR-059 (Composable Layer Architecture) per the d4 design:
ModelSpec → KernelGraph (IR) → fused KernelPlan tape. Dynamic traits exist only at construction/lowering; the decode loop executes a pre-lowered backend tape with zerodyndispatch. North-star: architecture exploration without sacrificing hand-fused Metal speed.Tasks
ModelSpec/AttentionSpec(TOML/JSON) — express current Qwen3.5 hybrid asinterleave{full_attention_interval=4}; validation must allow head_dim 256 (not just 64/80/96/128)AttentionVarianttrait (validate/emit_ir/metal_kernel_key/template) — new variant adds 1 trait + 1 MSL template, no decode-loop editsAcceptance — the zero-cost gate (the regression to prevent)
PlanFingerprintas the hand path: identical kernel list, dispatch count, command-buffer countGenericLinear/GenericAttention/HostCopy/UnfusedResidualRef: d4 (whole), ADR-059. Can start in parallel after the ADR-064 baseline is frozen.