KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform
The KV cache is the memory system inside every serious LLM serving platform.
Model weights get the headlines. GPU FLOPS get the charts. But once you serve real traffic with long prompts, multi-turn sessions, tools, retrieval, and agents, the KV cache becomes the thing you operate every minute.
It is not just “GPU memory.” It is reusable computation state. It decides how many conversations fit, which requests route where, how long prefill takes, whether long-context workloads fall over, and why a node with “free compute” can still reject traffic because memory is gone.
Why the KV cache exists
During autoregressive generation, each new token attends to previous tokens. Recomputing all previous keys and values on every step would be wasteful, so inference engines store key/value tensors from previous tokens.
The cache grows with:
- batch size
- sequence length
- number of layers
- number of KV heads
- head dimension
- precision used for KV tensors
- active concurrent requests
A simplified formula:
KV bytes per token =
2 # key + value
x layers
x kv_heads
x head_dim
x bytes_per_elementMultiply that by active tokens across the fleet. Now add fragmentation, temporary buffers, prefix sharing, evictions, and offload. Congratulations: you are running a distributed memory manager with a language model attached.
PagedAttention changed the mental model
The vLLM PagedAttention paper borrowed an operating-systems idea: manage KV cache in blocks instead of reserving large contiguous regions per sequence. This reduces fragmentation and makes sharing easier. The authors reported large throughput improvements over earlier serving systems in evaluated workloads because less memory is wasted.
The idea is now table stakes. The system does not need to allocate one giant slab per request. It allocates blocks, maps them to sequences, and can share blocks for common prefixes.
That gives us a better vocabulary:
- block allocation
- block table
- prefix sharing
- copy-on-write
- eviction
- offload
- cache hit confidence
- cache-aware routing
Once the cache is block-managed, the routing layer can ask a smarter question: “Which worker already has the most useful prefix blocks, and is it healthy enough to serve this request?”
The fleet-level problem
Inside one worker, KV cache management is a memory allocator problem. Across a fleet, it becomes a distributed systems problem.
The platform needs to know:
- which model and tokenizer produced the cached blocks
- which worker owns the blocks
- whether blocks are still present
- whether tenant boundaries allow reuse
- whether moving blocks costs more than recomputing
- which blocks deserve priority
- when stale cache state should be ignored
NVIDIA Dynamo is interesting here because it treats KV cache as control-plane state. Its docs describe KV-aware routing and KVBM-style block management across serving workers. TensorRT-LLM exposes KV cache features such as paged KV cache, reuse, offload, and priority-based retention. LMCache focuses directly on KV cache reuse, storage, and transfer for LLM serving. SGLang’s RadixAttention organizes reusable prefixes so structured generation programs can share state efficiently.
Different projects use different names, but the architectural direction is aligned: KV cache is not a hidden engine detail anymore.
Eviction is product policy
LRU is fine until the product knows better than the cache.
For LLM systems, not all cached tokens are equally valuable:
- system prompts are usually high value
- tenant policy packs are high value
- stable tool schemas are high value
- one-off pasted logs are low value
- failed tool branches are low value
- private cross-tenant state is not shareable
TensorRT-LLM’s priority-based eviction direction is useful because applications can tell the runtime which ranges matter more. That is the right idea. The agent or gateway often knows more than the memory allocator.
Practical eviction signals:
- prefix reuse count
- prefix length
- tenant scope
- freshness class
- expected next-turn probability
- request priority
- memory pressure
- recomputation cost
- transfer cost
Offload and transfer are not free wins
Offloading KV cache to CPU memory, disk, or a remote cache can increase effective capacity. It also adds latency and bandwidth pressure. Moving a huge prefix may be cheaper than recomputing it, but not always.
Ask:
- How many bytes move?
- How fast is the interconnect?
- Is the request latency-sensitive?
- Is the prefix likely to be reused again?
- Will offload traffic interfere with model serving?
- Does tenant isolation allow this movement?
The right system measures transfer cost and recompute cost. If moving cache is slower than recomputing, do not move it just because the diagram looked elegant.
Metrics that matter
At fleet scale, monitor:
- KV cache capacity by worker
- active tokens by worker
- block allocation failures
- fragmentation / wasted blocks
- prefix cache hit rate
- hit value in tokens avoided
- evictions by reason
- offload bytes and latency
- cache transfer success/failure
- stale cache metadata rate
- routing decisions by cache score
- requests denied by memory pressure
- TTFT with and without cache hit
Do not stop at “GPU utilization.” A GPU can be underutilized because it is memory bound, queue constrained, or holding too much active KV state.
The architecture I trust
A mature LLM serving platform should treat KV cache as a first-class resource:
- Block-managed cache inside workers.
- Prefix hashing that includes model, tokenizer, tenant, and prompt-template version.
- Router visibility into cache overlap and queue state.
- Conservative fallbacks when cache metadata is stale.
- Priority-aware eviction for durable prefixes.
- Offload only when transfer cost beats recompute cost.
- Metrics that report tokens of work avoided, not just hit count.
The future of inference routing is not round-robin. It is cost-aware placement over memory state.
Sources worth reading
- vLLM PagedAttention paper and Automatic Prefix Caching docs.
- TensorRT-LLM KV Cache System and NVIDIA KV reuse optimization notes.
- NVIDIA Dynamo KV-aware routing and KVBM-related docs.
- SGLang RadixAttention for prefix-tree reuse.
- LMCache documentation for KV cache storage, reuse, and transfer patterns.
- Hugging Face KV cache strategies for cache implementations and tradeoffs.
