Continuous batching for the hybrid model needs two allocators (implements part of ADR-063 serving): PagedKVAllocator (6 GQA layers, context-linear, quantizable pages) + GdnStateSlabAllocator (18 layers, fixed-size per sequence, checkpointable). Admission control must account for both.
Tasks
Acceptance (ADR-064 gates)
Ref: d5§7, ADR-063. Study vLLM Hybrid KV Cache Manager + PagedAttention.
Continuous batching for the hybrid model needs two allocators (implements part of ADR-063 serving):
PagedKVAllocator(6 GQA layers, context-linear, quantizable pages) +GdnStateSlabAllocator(18 layers, fixed-size per sequence, checkpointable). Admission control must account for both.Tasks
PagedKVAllocator(fixed pages, block tables, free queues; bytes/token = 6·2·512·dtype) +GdnStateSlabAllocator(1 slab/seq)free_kv_pages ≥ ceil((P+M)/T)ANDfree_gdn_slabs ≥ 1AND scratch fits AND adapter-generation compatible; high-watermark (0.85) soft-reservation for interactive, hard for benchAcceptance (ADR-064 gates)
Ref: d5§7, ADR-063. Study vLLM Hybrid KV Cache Manager + PagedAttention.