Skip to content

feat(batch): heterogeneous scheduler — PagedKVAllocator + GdnStateSlabAllocator (ADR-063) #178

@ohdearquant

Description

@ohdearquant

Continuous batching for the hybrid model needs two allocators (implements part of ADR-063 serving): PagedKVAllocator (6 GQA layers, context-linear, quantizable pages) + GdnStateSlabAllocator (18 layers, fixed-size per sequence, checkpointable). Admission control must account for both.

Tasks

  • PagedKVAllocator (fixed pages, block tables, free queues; bytes/token = 6·2·512·dtype) + GdnStateSlabAllocator (1 slab/seq)
  • admission: free_kv_pages ≥ ceil((P+M)/T) AND free_gdn_slabs ≥ 1 AND scratch fits AND adapter-generation compatible; high-watermark (0.85) soft-reservation for interactive, hard for bench
  • eviction/preemption: LRU prefix pages → pause → PreemptedRecompute (free KV+GDN, retain CPU history); never evict active KV without recompute
  • decode batch mixes GQA (M=batch GEMV) with GDN (per-seq sequential recurrence)

Acceptance (ADR-064 gates)

Ref: d5§7, ADR-063. Study vLLM Hybrid KV Cache Manager + PagedAttention.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestlattice-inferenceAffects the lattice-inference crate (transformer inference)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions