1/20 - KV Caching: The Memory That Makes Token Generation Possible
Autoregressive generation has an expensive habit: every new token needs attention over everything that came before it. Recomputing the same keys and values on every step would turn a useful language model into a very patient space heater. KV caching is the memory trick that avoids that repetition.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
Imagine reading a long contract and writing notes for every paragraph. When asked a follow-up question, you consult the notes instead of rereading the contract. The keys and values are those notes. They are not generated text, and they are not a response cache; they are intermediate attention state.
How to read this diagram: Start with Prompt tokens, where project k and v. The middle stage, Decode one token, read old k and v. The final stage, Next token, shows the observable result: reuse prior state. The arrows describe dependency order, not necessarily separate services.
First, separate three kinds of cache
Teams often say “the cache” while referring to three unrelated mechanisms. A response cache stores the final answer for an exact request and can avoid model execution entirely. A prefix KV cache stores attention state for token blocks shared by later prompts and avoids only the matching part of prefill. A sequence KV cache is the live state of one active generation; without it, every decode step would have to reconstruct the whole history.
This distinction matters during incidents. A Redis response-cache hit can return a byte-for-byte stored answer in milliseconds. A prefix-cache hit still runs the model and can still produce a different sampled answer. A live sequence-cache lookup is not a cross-request hit at all; it is ordinary decoder operation. Dashboards should name these independently instead of reporting one ambiguous “cache hit rate.”
How to read this diagram: Move left to right from the broadest reuse to the most request-local state. A response cache bypasses inference, a prefix KV cache skips matched prompt work, and a sequence KV cache accelerates the active decode loop. Their hit rates, lifetimes, security boundaries, and failure modes must be reported separately.
What actually happens
During prefill, each transformer layer projects every prompt token into query, key, and value vectors. The keys and values are retained. During decode, the new token produces one new query, key, and value; attention reads the cached keys and values for earlier positions and appends the new pair.
For a simplified multi-head attention model, KV memory grows roughly with 2 x layers x cached tokens x KV heads x head dimension x bytes per element. Grouped-query and multi-query attention reduce KV heads, which is why two models with similar parameter counts can have very different concurrency limits.
Cross-request prefix reuse adds another layer. A serving engine can hash full cache blocks, index them by token prefix, and let a later request reuse matching blocks. Correctness requires the same model revision, tokenizer, adapters, positional settings, multimodal inputs, and tenant security scope.
How to read this diagram: The upper row shows prefill producing K and V for many prompt positions in parallel. The lower row shows decode reading those retained positions and appending exactly one new pair. This asymmetry explains both the first-token pause and the memory-bandwidth pressure of long generation.
Why cache keys and values, but not queries?
For one attention head, the current query is compared with keys from every visible position; the resulting scores weight the corresponding values. At decode step t, the model needs one new query for position t, but it needs keys and values from positions 1..t. Earlier queries are no longer needed because their outputs were already used to produce earlier hidden states. Retaining K and V therefore removes repeated projection work while preserving exactly the state required by the next token.
The cache does not store “what the model knows.” It stores layer-specific tensors derived from a particular token sequence under particular weights. You cannot move those tensors to a model with a different revision, tokenizer, rotary-position configuration, or adapter and expect correctness. Even when tensor shapes match, the numerical meaning may not.
A worked example
Suppose a 32-layer model uses 8 KV heads, head dimension 128, and BF16 values. One cached token consumes about 2 x 32 x 8 x 128 x 2 bytes, or 128 KiB. A 4,000-token conversation therefore needs roughly 500 MiB of KV state before allocator metadata and fragmentation. Ten such conversations can consume about 5 GiB even though the model weights never changed.
How to read this diagram: Every factor in the formula multiplies cache size. The three lower panels hold query-head count constant while reducing KV heads from MHA to GQA and MQA. That reduction changes cache capacity directly, so parameter count alone cannot predict concurrent context capacity.
Now add the parts omitted by the clean equation. A block allocator rounds each sequence to block boundaries, so the last block is partly empty. Beam search, parallel samples, or speculative branches may temporarily duplicate state. CUDA graphs and attention kernels need workspace. Tensor parallelism can shard KV heads, while pipeline parallelism places different layers on different devices. The result is that “bytes per token” is an architecture property, but safe concurrency is a deployment property.
For capacity planning, calculate at least three numbers:
- Theoretical KV bytes per token, from layers, KV heads, head width, and dtype.
- Observed allocated bytes per token, including block rounding and runtime metadata.
- Admissible bytes per token, after reserving HBM for weights, graphs, workspaces, communication buffers, and an emergency margin.
Only the third number belongs in an admission controller. Using free HBM from an idle process as the cache budget creates an attractive benchmark and a fragile service.
The performance model
KV caching trades compute for memory. Decode becomes much cheaper computationally, but available HBM now limits how many sequences can remain active. Prefix reuse can reduce time to first token because matched prompt blocks skip prefill work; it does not make output-token decode free.
Latency should be split into queue time, prefix lookup, uncached prefill, and decode. Prefix reuse primarily reduces uncached prefill, so its most visible benefit is lower time to first token (TTFT). Time per output token (TPOT) usually changes little unless reuse also changes placement, batching, or memory pressure. A dashboard claiming that prefix caching accelerated decode is probably combining phases incorrectly.
How to read this diagram: Compare the red uncached-prefill span in the upper timeline with the small amber miss span below. Queue and first-decode work remain. The difference is TTFT saved by matched prompt tokens; it should not be reported as faster output-token decode.
A token-weighted hit rate is more useful than a request hit rate. If nine short prompts each reuse 16 tokens while one long prompt misses 8,000 tokens, the request hit rate is 90 percent but almost all prefill work remains. Track reused_prefix_tokens / eligible_prefix_tokens, then correlate it with prefill milliseconds avoided. Keep exact hits, partial hits, host-tier reloads, and misses separate because they have different costs.
How to read this diagram: The left panel asks how KV caching changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.
Expert lens
Block size is a real tuning knob. Large blocks reduce allocator and kernel overhead but waste more tail space and make partial prefix matches less likely. Small blocks improve reuse granularity but expand metadata, lookup work, and transfer operations. Measure with your prompt-length distribution rather than selecting a block size by folklore.
How to read this diagram: The logical table at the top gives a sequence a contiguous view, while the physical pool below can be scattered. Green blocks contain useful state, gray blocks are reusable, and the amber tail is allocated but partially empty. Block size trades tail waste against table and lookup overhead.
Paged allocation solves a problem similar to virtual memory. A sequence owns a logical list of cache blocks, while those blocks may live anywhere in the physical pool. Growth needs one free block rather than one larger contiguous allocation. Cancellation can return blocks immediately. Shared prefixes can point multiple logical sequences at immutable physical blocks with reference counts.
The last partially filled block is where policy becomes subtle. Reusing a partial block may require copy-on-write semantics because the next request can diverge at the following token. Some engines therefore share only full blocks. That simplifies ownership and hashing but makes block size directly affect reuse granularity. A 64-token block can waste up to 63 matching tokens at a divergence boundary; a 16-token block makes more matches available but quadruples block-table pressure.
How to read this diagram: The left panel is the baseline, No KV cache, characterized by repeat all prior projections and compute grows every step. The right panel applies With KV cache, changing the cost profile to compute only the new token and memory grows with sequence. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.
Where it wins
- Interactive chat with long histories
- Repeated system prompts and shared document prefixes
- High output-token generation where recomputation dominates
Where it disappoints
- Confusing KV cache with a Redis response cache
- Reusing blocks across tenants without a cache salt or namespace
- Ignoring adapter, tokenizer, or model-version identity
- Reporting hit rate without tokens saved or TTFT impact
How cross-request prefix matching works
Prefix caching is exact at the token level. The engine tokenizes the incoming request, groups token IDs into blocks, combines each block with the hash of its parent, and looks up the resulting chain under a cache scope. Reuse continues only while every block matches. Similar meaning, equivalent whitespace, or visually identical Unicode does not help when token IDs differ.
This explains why stable prompt construction matters. Reordering tools, injecting a changing timestamp near the beginning, or serializing JSON fields nondeterministically can invalidate thousands of otherwise reusable tokens. Put stable instructions first and volatile user data later. Canonicalize tool schemas. Do not normalize prompts inside the cache layer unless that same normalization is part of the model-visible request contract.
How to read this diagram: Requests A and B share two complete system-prompt blocks, so B can reuse them. Their third blocks differ, and matching stops at that exact token boundary. Semantic similarity is irrelevant: prefix reuse follows token IDs, parent hashes, block alignment, and cache scope.
Hashing is an index, not the security boundary. A collision-resistant digest makes lookup practical, but the preimage must include immutable model identity and an authorization scope. When two tenants are explicitly permitted to share a public system prefix, assign that prefix a deliberate shared namespace. Never infer sharing from equal text after the request has entered a private tenant scope.
Cache identity and tenant isolation
Every reusable block should inherit a scope digest built from at least the model revision, tokenizer and chat-template revision, adapter identity, positional configuration, relevant multimodal inputs, and tenant security salt. The block hash then extends that scope with parent hash and token IDs. Changing any correctness field creates a miss by construction.
How to read this diagram: The four identity inputs are hashed before token-block lookup. Only requests in the same resulting scope may attempt prefix reuse. A different model, tokenizer, adapter, modality, or tenant salt must force a miss, even when the visible prompt text is identical.
Isolation must happen before lookup. Fetching a block globally and checking ownership afterward can leak the existence, timing, or contents of another tenant’s prompt. It also invites reference-count bugs in which a private block remains reachable after its owner is deleted. Namespace indexes, metrics, eviction accounting, and administrative inspection consistently.
Encryption at rest does not solve cross-tenant reuse. GPU HBM is normally plaintext while kernels operate, and a logically authorized pointer is enough to expose state. The strongest control is not to create the pointer across isolation domains. Scrub or overwrite memory when required by the threat model, but do not use scrubbing as a substitute for correct ownership metadata.
Production checklist
- Estimate bytes per cached token for the exact architecture
- Set explicit HBM, host-cache, and eviction budgets
- Namespace reusable blocks by security and model identity
- Test cancellation and block reclamation under churn
- Export hit tokens, miss tokens, evictions, and onboarding latency
What to measure
- KV occupancy by GPU and model
- Reusable-prefix tokens divided by eligible tokens
- Time to first token split into cached and uncached requests
- Evictions, offloads, reloads, and failed allocations
- Active sequences and bytes per sequence
From one GPU to a production service
A single-user demo can let one process own an unbounded dynamic cache. A multi-tenant service cannot. It needs admission based on blocks, per-tenant ceilings, secure prefix namespaces, priority-aware eviction, and a response when a requested context cannot fit. The gateway should pass model and tenant identity explicitly; the engine should never infer security scope from prompt text.
At cluster scale, cache locality becomes a routing signal. A request with a 6,000-token prefix on worker A may be cheaper there even when worker B has a shorter queue. The router must compare saved prefill work with queue and transfer cost. A hit counter alone cannot make that decision.
How to read this diagram: Worker A owns most of the prefix but has a long queue; worker B has no hit but can finish sooner. The route score combines queue delay, missing prefill, transfer, and decode estimates. Cache affinity is one cost term, not an unconditional routing rule.
A practical routing score estimates completion time rather than maximizing matched tokens. One simple form is queue_delay + miss_prefill_time + transfer_time + expected_decode_time. Worker A may own a large prefix yet still lose when its queue is long. Worker B may be uncached but idle. The estimates need confidence bounds because stale cache advertisements can otherwise herd traffic toward a worker that has already evicted the blocks.
Cache metadata also has a cardinality cost. Broadcasting every block hash to every router is rarely sustainable. Production systems use summaries, hierarchical directories, ownership events, or a two-stage route in which the gateway selects a pool and the pool-local scheduler chooses a worker. Expire advertisements earlier than the underlying blocks or version them so a stale positive lookup degrades to a miss rather than a failed request.
A practical rollout begins with per-instance reuse, then adds routing based on observed block events, and only then considers host or distributed cache tiers. Each step adds failure modes. Preserve an uncached path so cache infrastructure remains an optimization rather than an availability dependency.
Cache tiers and offload economics
HBM is the fastest and most expensive tier. Host DRAM can retain more prefixes but introduces PCIe or NVLink transfer latency. Local NVMe and remote object storage increase capacity further while usually serving checkpoint-like or long-lived state rather than latency-critical decode. A tier is valuable only when reload is cheaper than recomputation and likely to finish before the request deadline.
Use a cost model per block: estimated prefill time saved minus lookup and transfer time, adjusted by reuse probability and eviction pressure. Blind least-recently-used eviction treats a 16-token block and a 4,096-token shared prefix as equally valuable. Size-aware and recompute-cost-aware policies are more aligned with inference goodput, although they need safeguards so one huge prefix does not monopolize the cache.
Cancellation, retries, and reference counts
Cancellation is a first-class allocator path. A disconnected client may leave prefill kernels in flight, shared block references, speculative branches, and reserved output capacity. Cleanup must be idempotent because timeouts and transport errors can race. Decrement each owned reference once, publish reusable blocks only after their contents are complete, and keep partially written blocks unreachable.
Retries should carry a stable request identity but must not assume the original worker still owns the cache. A retry can probe locality, then fall back to uncached execution. Availability is better when cache loss increases cost instead of changing correctness or causing a hard failure.
Design-review questions
- What exact identity fields make two cache blocks reusable?
- How many uncached tokens can each worker admit at peak?
- When does routing for locality lose to routing for load?
- Which tenants may share a prefix namespace?
- How quickly are blocks reclaimed after cancellation or model retirement?
How it connects to the rest of the series
PagedAttention manages these blocks efficiently; memory offloading extends their lifetime; chunked prefill controls how new blocks are produced; and prefill-decode disaggregation determines where the blocks must travel.
From equation to implementation
The cache exists per layer because each layer has its own projected keys and values. For batch size B, cached length S, KV-head count Hkv, head width D, layer count L, and element width E bytes, a useful first estimate is B x S x L x 2 x Hkv x D x E. Add block-table metadata, allocator slack, temporary workspace, and replicas before treating that number as a deployment limit.
Prefill writes many cache positions in parallel. Decode reads the entire relevant history but appends only one position per active sequence. That shift explains why prefill often looks compute-bound while decode becomes memory-bandwidth-bound. MQA and GQA reduce Hkv and therefore KV bytes without reducing query-head count.
Implementation sketch
tokens = tokenize(request)
scope = hash(model_revision, tokenizer, adapter, tenant_salt)
matched_blocks = prefix_index.lookup(scope, tokens)
allocate_blocks(tokens - matched_blocks)
run_prefill_for_misses()
for each generated_token:
attend_over_cached_kv()
append_one_kv_position()
release_or_mark_reusable_on_finish()Capacity planning
Do not reserve every free byte for KV. The engine still needs weights, CUDA graphs, activation workspace, communication buffers, and fragmentation headroom. A conservative admission controller computes required new blocks before accepting the request and rejects or queues it before the allocator reaches an unrecoverable state.
Admission should account for both prompt and output reservations. If a request arrives with 12,000 prompt tokens and max_tokens=4,000, accepting only the prompt footprint can deadlock decode after useful work has already been performed. Reserve a bounded output allowance, reclaim it as the request finishes, and use separate policies for hard maximums and statistically expected output lengths.
For a GPU with M_total bytes, define explicit budgets for weights, runtime workspace, safety headroom, and KV. Divide the KV budget by observed allocated bytes per token to obtain the token-block envelope, then test that envelope under fragmentation and cancellation churn. Convert it into user-facing limits through the real prompt/output distribution rather than advertising one optimistic context-length multiplied by an arbitrary concurrency number.
Benchmarking without fooling yourself
- Replay the real joint distribution of prompt length, output length, and concurrency.
- Separate exact prefix hits, partial block hits, misses, and host-cache reloads.
- Compare cached and uncached TTFT at the same queue depth.
- Run tenant-isolation and model-rollout tests that intentionally create similar prefixes.
A production failure to design for
A common incident begins after a model rollout: the cache key omits model revision, so old blocks appear reusable. The numerical shapes match, but the values belong to different weights. The response is corrupted without a clean crash. Cache identity must be correctness data, not a best-effort optimization.
Another failure is an eviction storm. A traffic burst fills HBM, evicts hot prefixes to host memory, and forces the next requests to reload them. Transfers consume bandwidth needed by active decode, latency rises, retries add more traffic, and the cache repeatedly evicts the same blocks. Protect the service with admission limits, hysteresis between high and low watermarks, bounded offload concurrency, and an uncached fallback that does not depend on the overloaded cache tier.
How to read this diagram: The operating cycle moves from Size to Protect, then Observe and Tune. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.
Further visual reading
Primary references
The takeaway
KV caching is the foundation beneath almost every fast decoder. Treat it as a capacity-managed, security-scoped memory system, not a boolean optimization.
