KV Caching: The Memory That Makes Token Generation Possible

#kv-cache #llm-inference #memory #prefix-caching #gpu

Autoregressive generation has an expensive habit: every new token needs attention over everything that came before it. Recomputing the same keys and values on every step would turn a useful language model into a very patient space heater. KV caching is the memory trick that avoids that repetition.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Imagine reading a long contract and writing notes for every paragraph. When asked a follow-up question, you consult the notes instead of rereading the contract. The keys and values are those notes. They are not generated text, and they are not a response cache; they are intermediate attention state.

Follow the state and work from left to right.

What actually happens

During prefill, each transformer layer projects every prompt token into query, key, and value vectors. The keys and values are retained. During decode, the new token produces one new query, key, and value; attention reads the cached keys and values for earlier positions and appends the new pair.

For a simplified multi-head attention model, KV memory grows roughly with 2 x layers x cached tokens x KV heads x head dimension x bytes per element. Grouped-query and multi-query attention reduce KV heads, which is why two models with similar parameter counts can have very different concurrency limits.

Cross-request prefix reuse adds another layer. A serving engine can hash full cache blocks, index them by token prefix, and let a later request reuse matching blocks. Correctness requires the same model revision, tokenizer, adapters, positional settings, multimodal inputs, and tenant security scope.

A worked example

Suppose a 32-layer model uses 8 KV heads, head dimension 128, and BF16 values. One cached token consumes about 2 x 32 x 8 x 128 x 2 bytes, or 128 KiB. A 4,000-token conversation therefore needs roughly 500 MiB of KV state before allocator metadata and fragmentation. Ten such conversations can consume about 5 GiB even though the model weights never changed.

The performance model

KV caching trades compute for memory. Decode becomes much cheaper computationally, but available HBM now limits how many sequences can remain active. Prefix reuse can reduce time to first token because matched prompt blocks skip prefill work; it does not make output-token decode free.

Expert lens

Block size is a real tuning knob. Large blocks reduce allocator and kernel overhead but waste more tail space and make partial prefix matches less likely. Small blocks improve reuse granularity but expand metadata, lookup work, and transfer operations. Measure with your prompt-length distribution rather than selecting a block size by folklore.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Interactive chat with long histories
Repeated system prompts and shared document prefixes
High output-token generation where recomputation dominates

Where it disappoints

Confusing KV cache with a Redis response cache
Reusing blocks across tenants without a cache salt or namespace
Ignoring adapter, tokenizer, or model-version identity
Reporting hit rate without tokens saved or TTFT impact

Production checklist

Estimate bytes per cached token for the exact architecture
Set explicit HBM, host-cache, and eviction budgets
Namespace reusable blocks by security and model identity
Test cancellation and block reclamation under churn
Export hit tokens, miss tokens, evictions, and onboarding latency

What to measure

KV occupancy by GPU and model
Reusable-prefix tokens divided by eligible tokens
Time to first token split into cached and uncached requests
Evictions, offloads, reloads, and failed allocations
Active sequences and bytes per sequence

From one GPU to a production service

A single-user demo can let one process own an unbounded dynamic cache. A multi-tenant service cannot. It needs admission based on blocks, per-tenant ceilings, secure prefix namespaces, priority-aware eviction, and a response when a requested context cannot fit. The gateway should pass model and tenant identity explicitly; the engine should never infer security scope from prompt text.

At cluster scale, cache locality becomes a routing signal. A request with a 6,000-token prefix on worker A may be cheaper there even when worker B has a shorter queue. The router must compare saved prefill work with queue and transfer cost. A hit counter alone cannot make that decision.

A practical rollout begins with per-instance reuse, then adds routing based on observed block events, and only then considers host or distributed cache tiers. Each step adds failure modes. Preserve an uncached path so cache infrastructure remains an optimization rather than an availability dependency.

Design-review questions

What exact identity fields make two cache blocks reusable?
How many uncached tokens can each worker admit at peak?
When does routing for locality lose to routing for load?
Which tenants may share a prefix namespace?
How quickly are blocks reclaimed after cancellation or model retirement?

How it connects to the rest of the series

PagedAttention manages these blocks efficiently; memory offloading extends their lifetime; chunked prefill controls how new blocks are produced; and prefill-decode disaggregation determines where the blocks must travel.

From equation to implementation

The cache exists per layer because each layer has its own projected keys and values. For batch size B, cached length S, KV-head count Hkv, head width D, layer count L, and element width E bytes, a useful first estimate is B x S x L x 2 x Hkv x D x E. Add block-table metadata, allocator slack, temporary workspace, and replicas before treating that number as a deployment limit.

Prefill writes many cache positions in parallel. Decode reads the entire relevant history but appends only one position per active sequence. That shift explains why prefill often looks compute-bound while decode becomes memory-bandwidth-bound. MQA and GQA reduce Hkv and therefore KV bytes without reducing query-head count.

Implementation sketch

tokens = tokenize(request)
scope = hash(model_revision, tokenizer, adapter, tenant_salt)
matched_blocks = prefix_index.lookup(scope, tokens)
allocate_blocks(tokens - matched_blocks)
run_prefill_for_misses()
for each generated_token:
    attend_over_cached_kv()
    append_one_kv_position()
release_or_mark_reusable_on_finish()

Capacity planning

Do not reserve every free byte for KV. The engine still needs weights, CUDA graphs, activation workspace, communication buffers, and fragmentation headroom. A conservative admission controller computes required new blocks before accepting the request and rejects or queues it before the allocator reaches an unrecoverable state.

Benchmarking without fooling yourself

Replay the real joint distribution of prompt length, output length, and concurrency.
Separate exact prefix hits, partial block hits, misses, and host-cache reloads.
Compare cached and uncached TTFT at the same queue depth.
Run tenant-isolation and model-rollout tests that intentionally create similar prefixes.

A production failure to design for

A common incident begins after a model rollout: the cache key omits model revision, so old blocks appear reusable. The numerical shapes match, but the values belong to different weights. The response is corrupted without a clean crash. Cache identity must be correctness data, not a best-effort optimization.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

KV caching is the foundation beneath almost every fast decoder. Treat it as a capacity-managed, security-scoped memory system, not a boolean optimization.

Production LLM Systems Tutorial 1: End-to-End Application Design Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge