The KV cache is a very big variable-size allocation in LLM inference. Its size is unknown at startup and proportional to sequence length. Using standard array-like objects for storing the cache can waste large amounts of RAM or force expensive copies. In this article we will consider what uzu did originally, what compromise we had, and what decision we came to in the end.
During decoding each transformer layer maintains a cache of keys and values for every token processed so far. When the model processes token N, it needs keys and values for every previous token 0..N-1, so the cache has shape
[sequence_length, num_kv_heads, head_dim]
where num_kv_heads and head_dim are fixed constants, but sequence_length is not fixed. It represents all processed tokens at the moment.
The simplest approach is to allocate the full worst-case buffer, whose size is
max_sequence_length * num_kv_heads * head_dim * element_size
Since the worst case is covered, no reallocations or copies are needed. Additionally, we get a stable buffer pointer and straightforward logic.
But it is obvious that too much memory is being allocated. Let’s calculate how much memory is needed for Qwen3 4B at bf16 with context ~41k tokens.
36 layers * 40960 tokens * 8 heads * 128 dims * 2 buffers * 2 bytes = 5.625 GiB
2 buffers because one is for keys and one is for values
Even with small interactions with the model, which can fit into e.g. 100 tokens, ~5.6 GiB will be allocated instead of ~14 MiB for 100 tokens. And it’s only for KV cache, on top of the model weights and other allocations.
Another approach is to allocate smaller buffers and extend them on demand. This keeps memory proportional to the current sequence length, but brings latency to the decoding process.
Let’s consider the same Qwen3 4B model. If a grow step adds 4096 tokens, each layer’s KV cache grows by 16 MiB, so copying 36 layers means a 576 MiB GPU blit at a random moment during the inference.
There is a third, software-side approach known as PagedAttention, introduced by vLLM. It takes the idea of virtual memory and paging from operating systems.
KV cache splits into fixed-size blocks, each block stores keys and values for a fixed number of tokens (e.g. 16). These blocks do not need to be contiguous in physical GPU memory. Each sequence keeps a block table that maps its logical block indices to physical block addresses, exactly like an OS page table maps virtual pages to physical frames. The attention kernel is modified to read keys and values through this indirection, so it sees a contiguous logical view while the physical blocks are scattered across memory.
This approach avoids both kinds of memory waste:
- External fragmentation is eliminated. Every block is the same size, so freed blocks from finished sequences can always be reused by any other sequence — no contiguous-region gaps that nothing fits into.
- Internal fragmentation is bounded to at most one block per sequence. Only the last, partially-filled block wastes space, instead of the entire pre-allocated worst-case buffer.
The cost is that attention is no longer a read over a flat buffer — the kernel must be rewritten to walk the block table, and an allocator must manage the block pool. This is a software-level reimplementation of what virtual memory already does in hardware. That observation is what motivates the next approach: instead of emulating paging in the kernel, use the paging the GPU already provides.
Solutions 1 and 2 are simple, but both are compromises between speed and memory. Paged attention solves both, but at the price of a custom attention kernel and block allocator. We used the maximum pre-allocated buffer because of simplicity and speed is a crucial indicator for us.
Two terms up front. A sparse buffer is a buffer whose virtual address range exists but has no physical memory behind it until you explicitly back individual pages — exactly the page-table indirection from Solution 3, but provided by the GPU instead of emulated in the kernel. A placement heap is a pool of physical memory you allocate yourself and then map, page by page, into those sparse buffers. So the sparse buffer is the address space, the placement heap is the physical pages, and mapping connects the two.
Starting from Metal 4, devices that support placement sparse resources can use sparse buffers as growing arrays without copy operations. This is a hardware capability, not only an SDK version switch, so check it at runtime before using the API.
device.supportsPlacementSparseThe workflow contains 3 steps: create a sparse buffer, create a placement heap, map pages. Take a closer look at them.
Let’s consider default CPU-accessible buffer allocation
import Metal
let device = MTLCreateSystemDefaultDevice()!
let bufferSize = 128 * 1024 * 1024
let denseBuffer = device.makeBuffer(
length: bufferSize,
options: .storageModeShared
)!It’s possible to watch memory consumption using the vmmap utility
> vmmap -summary PID
VIRTUAL RESIDENT DIRTY
REGION TYPE SIZE SIZE SIZE
=========== ======= ======== =====
...
IOAccelerator (graphics) 128.4M 64K 64K
...Virtual size — the address space reserved by the process. It does not mean the process is actually using that much physical RAM. A large virtual region may be unmapped-on-demand, shared, compressed, swapped out, or never touched.
Resident size — the portion of that virtual memory that currently has physical memory backing it and is present in RAM. This is closer to “how much real memory is involved right now,” but it can include shared pages, so it is not always equal to private RAM cost.
Dirty size — resident memory that has been modified and cannot simply be discarded/reloaded from the original file. Dirty pages usually represent private heap/stack data, copy-on-write pages that were written to, or modified file-backed mappings. Dirty memory is often the most important number when estimating how much memory the process uniquely pressures the system with.
Now let’s consider how to create sparse buffer
let sparsePageSize = MTLSparsePageSize.size256
let sparseBuffer = device.makeBuffer(
length: bufferSize,
options: .storageModePrivate,
placementSparsePageSize: sparsePageSize
)!MTLSparsePageSize — size in kilobytes.
Flag MTLResourceOptions.storageModePrivate makes this buffer GPU-private, which is faster for compute but requires explicit synchronization for CPU access. For our KV cache case, CPU access doesn’t matter since reads and writes happen only in kernels. Note that after buffer creation there is no allocation of 128 MiB.
> vmmap -summary PID
VIRTUAL RESIDENT DIRTY
REGION TYPE SIZE SIZE SIZE
=========== ======= ======== =====
...
IOAccelerator (graphics) 384K 64K 64K
...Specifying placementSparsePageSize makes this buffer sparse. It’s not possible to create a sparse buffer with the MTLResourceOptions.storageModeShared option.
Just after buffer creation, buffer reads return zeros and writes are no-op.
Let’s create a heap of 512 MiB
let heapDesc = MTLHeapDescriptor()
heapDesc.maxCompatiblePlacementSparsePageSize = sparsePageSize
heapDesc.size = 512 * 1024 * 1024
heapDesc.sparsePageSize = sparsePageSize
heapDesc.storageMode = .private
heapDesc.type = .placement
let heap = device.makeHeap(descriptor: heapDesc)!Same as for the buffer, MTLStorageMode must be private.
As for MTLHeapType, it must be placement since sparse is used for textures.
Now vmmap shows the next results
> vmmap -summary PID
VIRTUAL RESIDENT DIRTY
REGION TYPE SIZE SIZE SIZE
=========== ======= ======== =====
...
IOAccelerator (graphics) 384K 64K 64K
...
owned unmapped (graphics) 512.0M 16K 16K
...owned unmapped (graphics) — graphics-related memory that is charged to your process, but is not currently mapped into your process’s virtual address space as a normal CPU-visible region.
The heap exists and is charged to the process as graphics memory, but it is not mapped into the CPU-visible address space and does not appear as resident memory in this measurement yet.
The point of the operation is that parts (pages) of the buffer are mapped to the heap. Reads and writes on mapped buffer pages allow operations to proceed as expected. To execute mapping or unmapping, it is necessary to create an MTL4CommandQueue object.
let sparsePageSizeBytes = 256 * 1024
let totalSparseBufferPages = sparseBuffer.length / sparsePageSizeBytes
let operation = MTL4UpdateSparseBufferMappingOperation(
mode: .map,
bufferRange: NSRange(location: 0, length: totalSparseBufferPages),
heapOffset: 0
)
let commandQueue4 = device.makeMTL4CommandQueue()!
commandQueue4.updateMappings(
buffer: sparseBuffer,
heap: heap,
operations: [operation]
)updateMappings takes an array of MTL4UpdateSparseBufferMappingOperation.
Very important to note that bufferRange and heapOffset are counted not in bytes, but in pages!
bufferRange specifies which pages of buffer will be mapped (or unmapped).
heapOffset specifies the heap offset in pages to which the buffer will be mapped (or unmapped).
mode could be only map and unmap.
The mapping operation is enqueued on the GPU timeline. It’s asynchronous with respect to CPU code, but Metal guarantees that the mapping completes before any subsequent command buffer on the same MTL4CommandQueue accesses those pages. If compute work is submitted through MTLCommandQueue, you must explicitly synchronize it with the MTL4CommandQueue mapping work before those command buffers access the mapped pages.
After mapping, you can use the sparse buffer as a default buffer.
Very important to note that one MTLBuffer can be mapped to one or multiple MTLHeap instances, and one MTLHeap can be mapped to one or multiple MTLBuffer instances, as shown in the picture.
There is also no restriction on which heaps are mapped to one buffer, so a new buffer such as buf 3 can be mapped across free pages in existing heaps to prevent fragmentation.
It allows us to create a pool of heaps, map buffers to them, unmap buffers from them, add new heaps if there are no free pages in the existing pool, and remove a heap from the pool if no pages are mapped to it. This approach takes advantage of both solutions described above: memory allocation on demand without copies.
Recently we started using sparse buffers for KV cache storage. In the beginning, for each transformer layer we allocate 2 sparse buffers (for keys and values) with size for maximal sequence length: max_sequence_length * num_kv_heads * head_dim * element_size.
As mentioned above, at this step no physical memory is allocated for those buffers. Then, before each forward pass, the necessary number of pages are mapped to heaps and new heaps are allocated if needed.
According to our benchmarks using sparse buffers brings no tangible impact on performance, but significantly reduces memory consumption.
The KV cache memory problem has two naive solutions — grow-and-copy (slow) or pre-allocate-max (wasteful). Paged attention fixes both by emulating virtual memory in software, but at the cost of a rewritten attention kernel and a block allocator. Metal's placement heap API provides the same paging as a hardware primitive: reserve a stable virtual address range up front, back it with physical pages only as the sequence extends, and return those pages when the session ends — with no kernel changes.
The result is an inference engine that can advertise a big context window without committing many GBs of RAM at session start, keeping short conversations cheap and long ones possible.

