4/20 - PagedAttention: Virtual Memory for the KV Cache
LLM requests do not arrive with neat, equal-length memory needs. Their KV caches grow one token at a time, finish unpredictably, and often share prefixes. PagedAttention brought an operating-system answer to that allocator problem: fixed-size logical blocks mapped to physical cache blocks.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A traditional allocator is like reserving an entire hotel floor because a guest might stay for a month. Paging assigns rooms as nights are actually used. The guest sees one continuous stay; the hotel can place rooms wherever capacity exists.
How to read this diagram: Start with Logical sequence, where token block table. The middle stage, Block manager, map logical blocks. The final stage, Physical KV pool, shows the observable result: noncontiguous blocks. The arrows describe dependency order, not necessarily separate services.
What actually happens
PagedAttention separates a sequence’s logical KV address space from physical GPU blocks. A block table records the mapping. Attention kernels follow that table, so a sequence does not require one contiguous allocation sized for its maximum possible length.
Fixed-size blocks sharply reduce external fragmentation and make allocation incremental. The final block can still have internal slack. Copy-on-write or reference counting lets beams and requests share full prefix blocks until their token paths diverge.
The kernel and allocator must agree on block layout, token count, head geometry, and device placement. Paging is not simply a Python dictionary around a normal attention kernel; the lookup and gather behavior are part of the execution path.
A worked example
Consider three sequences that need 300, 700, and 1,100 tokens of cache, with 128-token blocks. They consume 3, 6, and 9 blocks: capacity for 384, 768, and 1,152 tokens. Waste is confined mostly to the three tail blocks instead of large per-request reservations.
The performance model
Paged allocation increases the number of sequences that fit in HBM, which allows larger effective batches and higher throughput. Its largest gain is often indirect: memory utilization unlocks scheduler concurrency. Block-table indirection and small-block metadata are the cost side.
How to read this diagram: The left panel asks how PagedAttention changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.
Expert lens
Block size couples allocator efficiency to kernel efficiency. Prefix sharing also depends on block boundaries: only complete matching blocks are trivially shareable, though some engines support partial reuse. Benchmark fragmentation, lookup overhead, and reuse together.
How to read this diagram: The left panel is the baseline, Contiguous reservation, characterized by reserve for maximum growth and external fragmentation. The right panel applies Paged allocation, changing the cost profile to allocate fixed blocks on demand and small tail fragmentation. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.
Where it wins
- Mixed-length online traffic
- Beam search and shared prefixes
- High concurrency under tight HBM budgets
Where it disappoints
- Equating PagedAttention with FlashAttention
- Choosing tiny blocks without metadata analysis
- Ignoring reference counts during cancellation
- Measuring utilization without scheduler throughput
Production checklist
- Size blocks against prompt and decode distributions
- Test allocation under cancellation and preemption
- Validate copy-on-write for branching sequences
- Track physical and logical block ownership
- Stress block exhaustion and reclamation
What to measure
- Allocated blocks and free blocks
- Internal slack in tail blocks
- Allocation and eviction latency
- Shared-block reference counts
- Requests admitted per GPU
From one GPU to a production service
A prototype allocator only needs allocate and free. A production block manager needs reservation, ownership, reference counting, copy-on-write, eviction, preemption, swapping, recovery after worker failure, and observability that can explain why admission failed.
Multi-tenant fairness enters through memory. One long-context customer can occupy blocks long after shorter requests arrive. Per-class limits and age-aware admission prevent a large context from either monopolizing HBM or starving forever.
At cluster scale, block events can feed a router, but event delay matters. A router acting on stale ownership may send a request to a worker that already evicted the prefix. Score cache evidence with freshness and always combine it with live load.
Design-review questions
- Are blocks reserved before compute begins?
- What invariant prevents use-after-free across shared prefixes?
- How does preemption choose swap, recompute, or reject?
- What is the maximum stale interval for routing metadata?
- Can one tenant exhaust physical blocks despite request limits?
How it connects to the rest of the series
KV caching supplies the state being paged. Continuous batching consumes the extra concurrency. Prefix reuse shares blocks, while memory offloading adds slower tiers beneath the GPU pool.
From equation to implementation
The block table is effectively a page table for KV state. Logical block i of a sequence maps to a physical block identifier. The attention kernel translates logical positions while loading K and V, so physical blocks may be scattered without changing the model’s causal view.
Allocation must be synchronized with scheduling. Before an iteration, the scheduler determines how many new tokens may be produced and reserves the corresponding blocks. If reservation fails after compute starts, recovery is harder. Mature systems make block admission a precondition for execution.
Implementation sketch
needed = blocks_for(prompt_tokens + max_step_growth)
if free_blocks < needed:
preempt_or_queue(request)
logical = sequence.block_table
for token_block in prompt:
logical.append(pool.allocate())
on_branch:
child.share_prefix(logical, refcount=True)
on_write_shared_tail:
child.copy_on_write()
on_finish:
pool.release(logical)Capacity planning
Expected internal waste is bounded by one partially used tail block per live sequence, but branching, reserved growth, and preemption can add more. Simulate with actual sequence traces to estimate peak physical blocks; average utilization hides synchronized bursts.
Benchmarking without fooling yourself
- Replay churn with arrivals, completions, cancellation, and branching.
- Compare block sizes across throughput, slack, and lookup overhead.
- Force near-exhaustion to test preemption and admission behavior.
- Track useful cached tokens per allocated physical token slot.
A production failure to design for
A cancellation path decrements shared-prefix references twice. The allocator reuses a block while another beam still reads it, creating nondeterministic output corruption. Reference ownership needs invariant tests and generation-stamped block handles.
How to read this diagram: The operating cycle moves from Model to Allocate, then Stress and Tune. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.
Deeper engineering guide
PagedAttention separates a sequence’s logical token positions from physical KV storage. Fixed-size physical blocks are allocated on demand, and a block table maps logical block numbers to those locations. The attention kernel follows that table while reading K and V. Sequences can grow without reserving one contiguous maximum-context buffer.
How to read this diagram: Follow the state from Reserve through Map and Attend to Reclaim. Each box is an ownership or computation boundary. In particular, the block table is correctness-critical state on every decode iteration. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.
Block size mediates metadata overhead, fragmentation, kernel efficiency, and prefix granularity. Smaller blocks waste fewer tail tokens and improve partial-prefix reuse but create larger tables and more pointer chasing. Larger blocks make kernels and transfers efficient but increase internal fragmentation. Select size from the joint prompt, output, and cancellation distribution.
How to read this diagram: The bars compare Reserved maximum with Allocated blocks on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Memory follows active tokens instead of max length.”, remains larger than the risk, “Block tables and gathers add metadata and kernel work.”, under production traffic.
Shared prefixes require immutable blocks or copy-on-write. Two sequences may reference the same full block until one diverges. A partially filled shared block cannot be appended independently without copying. Reference counts, cancellation, beam pruning, and retries must be idempotent or the pool slowly leaks capacity.
How to read this diagram: State advances from Free to Reserved, Active, and finally Reclaimed. The labels below each state identify what becomes true at that boundary. The governing invariant is: A block is reusable only after writes complete and every prior reference is gone. Retries and cancellation must preserve the same transition rules.
How to read this diagram: The four panels are independent review axes: Allocation, Sharing, Placement, and Evidence. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Free-block count alone cannot explain fragmentation or sharing efficiency.
How to read this diagram: This is a causal chain, not four unrelated symptoms. Cancel races triggers Pool shrinks, which creates Admission stalls. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.
Primary references
The takeaway
PagedAttention made KV memory elastic enough for a real server. Its core achievement is not an attention approximation; it is disciplined address translation.
