PagedAttention: Virtual Memory for the KV Cache

#pagedattention #vllm #kv-cache #memory-management #llm-serving

LLM requests do not arrive with neat, equal-length memory needs. Their KV caches grow one token at a time, finish unpredictably, and often share prefixes. PagedAttention brought an operating-system answer to that allocator problem: fixed-size logical blocks mapped to physical cache blocks.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A traditional allocator is like reserving an entire hotel floor because a guest might stay for a month. Paging assigns rooms as nights are actually used. The guest sees one continuous stay; the hotel can place rooms wherever capacity exists.

Follow the state and work from left to right.

What actually happens

PagedAttention separates a sequence’s logical KV address space from physical GPU blocks. A block table records the mapping. Attention kernels follow that table, so a sequence does not require one contiguous allocation sized for its maximum possible length.

Fixed-size blocks sharply reduce external fragmentation and make allocation incremental. The final block can still have internal slack. Copy-on-write or reference counting lets beams and requests share full prefix blocks until their token paths diverge.

The kernel and allocator must agree on block layout, token count, head geometry, and device placement. Paging is not simply a Python dictionary around a normal attention kernel; the lookup and gather behavior are part of the execution path.

A worked example

Consider three sequences that need 300, 700, and 1,100 tokens of cache, with 128-token blocks. They consume 3, 6, and 9 blocks: capacity for 384, 768, and 1,152 tokens. Waste is confined mostly to the three tail blocks instead of large per-request reservations.

The performance model

Paged allocation increases the number of sequences that fit in HBM, which allows larger effective batches and higher throughput. Its largest gain is often indirect: memory utilization unlocks scheduler concurrency. Block-table indirection and small-block metadata are the cost side.

Expert lens

Block size couples allocator efficiency to kernel efficiency. Prefix sharing also depends on block boundaries: only complete matching blocks are trivially shareable, though some engines support partial reuse. Benchmark fragmentation, lookup overhead, and reuse together.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Mixed-length online traffic
Beam search and shared prefixes
High concurrency under tight HBM budgets

Where it disappoints

Equating PagedAttention with FlashAttention
Choosing tiny blocks without metadata analysis
Ignoring reference counts during cancellation
Measuring utilization without scheduler throughput

Production checklist

Size blocks against prompt and decode distributions
Test allocation under cancellation and preemption
Validate copy-on-write for branching sequences
Track physical and logical block ownership
Stress block exhaustion and reclamation

What to measure

Allocated blocks and free blocks
Internal slack in tail blocks
Allocation and eviction latency
Shared-block reference counts
Requests admitted per GPU

From one GPU to a production service

A prototype allocator only needs allocate and free. A production block manager needs reservation, ownership, reference counting, copy-on-write, eviction, preemption, swapping, recovery after worker failure, and observability that can explain why admission failed.

Multi-tenant fairness enters through memory. One long-context customer can occupy blocks long after shorter requests arrive. Per-class limits and age-aware admission prevent a large context from either monopolizing HBM or starving forever.

At cluster scale, block events can feed a router, but event delay matters. A router acting on stale ownership may send a request to a worker that already evicted the prefix. Score cache evidence with freshness and always combine it with live load.

Design-review questions

Are blocks reserved before compute begins?
What invariant prevents use-after-free across shared prefixes?
How does preemption choose swap, recompute, or reject?
What is the maximum stale interval for routing metadata?
Can one tenant exhaust physical blocks despite request limits?

How it connects to the rest of the series

KV caching supplies the state being paged. Continuous batching consumes the extra concurrency. Prefix reuse shares blocks, while memory offloading adds slower tiers beneath the GPU pool.

From equation to implementation

The block table is effectively a page table for KV state. Logical block i of a sequence maps to a physical block identifier. The attention kernel translates logical positions while loading K and V, so physical blocks may be scattered without changing the model’s causal view.

Allocation must be synchronized with scheduling. Before an iteration, the scheduler determines how many new tokens may be produced and reserves the corresponding blocks. If reservation fails after compute starts, recovery is harder. Mature systems make block admission a precondition for execution.

Implementation sketch

needed = blocks_for(prompt_tokens + max_step_growth)
if free_blocks < needed:
    preempt_or_queue(request)
logical = sequence.block_table
for token_block in prompt:
    logical.append(pool.allocate())
on_branch:
    child.share_prefix(logical, refcount=True)
on_write_shared_tail:
    child.copy_on_write()
on_finish:
    pool.release(logical)

Capacity planning

Expected internal waste is bounded by one partially used tail block per live sequence, but branching, reserved growth, and preemption can add more. Simulate with actual sequence traces to estimate peak physical blocks; average utilization hides synchronized bursts.

Benchmarking without fooling yourself

Replay churn with arrivals, completions, cancellation, and branching.
Compare block sizes across throughput, slack, and lookup overhead.
Force near-exhaustion to test preemption and admission behavior.
Track useful cached tokens per allocated physical token slot.

A production failure to design for

A cancellation path decrements shared-prefix references twice. The allocator reuses a block while another beam still reads it, creating nondeterministic output corruption. Reference ownership needs invariant tests and generation-stamped block handles.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

PagedAttention made KV memory elastic enough for a real server. Its core achievement is not an attention approximation; it is disciplined address translation.

FlashAttention: Why Moving Fewer Bytes Beats Doing Fewer FLOPs Batch Inference: When Throughput Matters More Than Immediacy