PagedAttention: Virtual Memory for the KV Cache
LLM requests do not arrive with neat, equal-length memory needs. Their KV caches grow one token at a time, finish unpredictably, and often share prefixes. PagedAttention brought an operating-system answer to that allocator problem: fixed-size logical blocks mapped to physical cache blocks.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A traditional allocator is like reserving an entire hotel floor because a guest might stay for a month. Paging assigns rooms as nights are actually used. The guest sees one continuous stay; the hotel can place rooms wherever capacity exists.
What actually happens
PagedAttention separates a sequence’s logical KV address space from physical GPU blocks. A block table records the mapping. Attention kernels follow that table, so a sequence does not require one contiguous allocation sized for its maximum possible length.
Fixed-size blocks sharply reduce external fragmentation and make allocation incremental. The final block can still have internal slack. Copy-on-write or reference counting lets beams and requests share full prefix blocks until their token paths diverge.
The kernel and allocator must agree on block layout, token count, head geometry, and device placement. Paging is not simply a Python dictionary around a normal attention kernel; the lookup and gather behavior are part of the execution path.
A worked example
Consider three sequences that need 300, 700, and 1,100 tokens of cache, with 128-token blocks. They consume 3, 6, and 9 blocks: capacity for 384, 768, and 1,152 tokens. Waste is confined mostly to the three tail blocks instead of large per-request reservations.
The performance model
Paged allocation increases the number of sequences that fit in HBM, which allows larger effective batches and higher throughput. Its largest gain is often indirect: memory utilization unlocks scheduler concurrency. Block-table indirection and small-block metadata are the cost side.
Expert lens
Block size couples allocator efficiency to kernel efficiency. Prefix sharing also depends on block boundaries: only complete matching blocks are trivially shareable, though some engines support partial reuse. Benchmark fragmentation, lookup overhead, and reuse together.
Where it wins
- Mixed-length online traffic
- Beam search and shared prefixes
- High concurrency under tight HBM budgets
Where it disappoints
- Equating PagedAttention with FlashAttention
- Choosing tiny blocks without metadata analysis
- Ignoring reference counts during cancellation
- Measuring utilization without scheduler throughput
Production checklist
- Size blocks against prompt and decode distributions
- Test allocation under cancellation and preemption
- Validate copy-on-write for branching sequences
- Track physical and logical block ownership
- Stress block exhaustion and reclamation
What to measure
- Allocated blocks and free blocks
- Internal slack in tail blocks
- Allocation and eviction latency
- Shared-block reference counts
- Requests admitted per GPU
From one GPU to a production service
A prototype allocator only needs allocate and free. A production block manager needs reservation, ownership, reference counting, copy-on-write, eviction, preemption, swapping, recovery after worker failure, and observability that can explain why admission failed.
Multi-tenant fairness enters through memory. One long-context customer can occupy blocks long after shorter requests arrive. Per-class limits and age-aware admission prevent a large context from either monopolizing HBM or starving forever.
At cluster scale, block events can feed a router, but event delay matters. A router acting on stale ownership may send a request to a worker that already evicted the prefix. Score cache evidence with freshness and always combine it with live load.
Design-review questions
- Are blocks reserved before compute begins?
- What invariant prevents use-after-free across shared prefixes?
- How does preemption choose swap, recompute, or reject?
- What is the maximum stale interval for routing metadata?
- Can one tenant exhaust physical blocks despite request limits?
How it connects to the rest of the series
KV caching supplies the state being paged. Continuous batching consumes the extra concurrency. Prefix reuse shares blocks, while memory offloading adds slower tiers beneath the GPU pool.
From equation to implementation
The block table is effectively a page table for KV state. Logical block i of a sequence maps to a physical block identifier. The attention kernel translates logical positions while loading K and V, so physical blocks may be scattered without changing the model’s causal view.
Allocation must be synchronized with scheduling. Before an iteration, the scheduler determines how many new tokens may be produced and reserves the corresponding blocks. If reservation fails after compute starts, recovery is harder. Mature systems make block admission a precondition for execution.
Implementation sketch
needed = blocks_for(prompt_tokens + max_step_growth)
if free_blocks < needed:
preempt_or_queue(request)
logical = sequence.block_table
for token_block in prompt:
logical.append(pool.allocate())
on_branch:
child.share_prefix(logical, refcount=True)
on_write_shared_tail:
child.copy_on_write()
on_finish:
pool.release(logical)Capacity planning
Expected internal waste is bounded by one partially used tail block per live sequence, but branching, reserved growth, and preemption can add more. Simulate with actual sequence traces to estimate peak physical blocks; average utilization hides synchronized bursts.
Benchmarking without fooling yourself
- Replay churn with arrivals, completions, cancellation, and branching.
- Compare block sizes across throughput, slack, and lookup overhead.
- Force near-exhaustion to test preemption and admission behavior.
- Track useful cached tokens per allocated physical token slot.
A production failure to design for
A cancellation path decrements shared-prefix references twice. The allocator reuses a block while another beam still reads it, creating nondeterministic output corruption. Reference ownership needs invariant tests and generation-stamped block handles.
Primary references
The takeaway
PagedAttention made KV memory elastic enough for a real server. Its core achievement is not an attention approximation; it is disciplined address translation.
