Skip to content

feat(inference): GDN snapshot-per-window speculative rollback + n-gram/MTP verify #176

@ohdearquant

Description

@ohdearquant

Speculative decoding for the hybrid model. The hard part is GDN rollback: on rejection the d×d state is invalid. Do NOT invert the recurrence — snapshot once per speculative window, restore + replay accepted prefix on rejection. The 0.8B checkpoint ships an MTP head (mtp_num_hidden_layers=1) — verify usability before training anything.

Tasks

  • snapshot ring (2-3 slots): copy 18 GDN states (9 MiB f16, ~50-150µs) + KV cursor markers per window
  • verify-pass: GQA batched attention over K positions + GDN micro-chunk recurrence (one S read/write, K-step loop in registers — NOT K separate kernels)
  • reject → restore snapshot + reset_fast() KV + replay accepted prefix
  • n-gram speculator first (zero training); then verify MTP head usability
  • log acceptance distribution + replay length + snapshot cost

Acceptance (ADR-064 gates)

  • default-on only if effective acceptance ≥0.75 (K=4) after replay accounting — else opt-in for code/repetitive workloads
  • state after spec decode == non-spec decode under greedy (S, conv, KV cursor, logits)
  • p50 speedup ≥1.05, p10 ≥1.00 to enable

Ref: d3§5,§6,§7,§8. Builds on NgramSpeculator/MtpVerifier/reset_fast(). Highest-risk experiment; kill early if <0.72 accept.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlattice-inferenceAffects the lattice-inference crate (transformer inference)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions