KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform

#ai #inference #llm #kv-cache #memory #routing #dynamo #vllm

The KV cache is the memory system inside every serious LLM serving platform.

Model weights get the headlines. GPU FLOPS get the charts. But once you serve real traffic with long prompts, multi-turn sessions, tools, retrieval, and agents, the KV cache becomes the thing you operate every minute.

It is not just “GPU memory.” It is reusable computation state. It decides how many conversations fit, which requests route where, how long prefill takes, whether long-context workloads fall over, and why a node with “free compute” can still reject traffic because memory is gone.

Why the KV cache exists

During autoregressive generation, each new token attends to previous tokens. Recomputing all previous keys and values on every step would be wasteful, so inference engines store key/value tensors from previous tokens.

The cache grows with:

batch size
sequence length
number of layers
number of KV heads
head dimension
precision used for KV tensors
active concurrent requests

A simplified formula:

KV bytes per token =
  2                         # key + value
  x layers
  x kv_heads
  x head_dim
  x bytes_per_element

Multiply that by active tokens across the fleet. Now add fragmentation, temporary buffers, prefix sharing, evictions, and offload. Congratulations: you are running a distributed memory manager with a language model attached.

KV cache pressure can make a GPU look underutilized while still being unable to admit more work.

PagedAttention changed the mental model

The vLLM PagedAttention paper borrowed an operating-systems idea: manage KV cache in blocks instead of reserving large contiguous regions per sequence. This reduces fragmentation and makes sharing easier. The authors reported large throughput improvements over earlier serving systems in evaluated workloads because less memory is wasted.

The idea is now table stakes. The system does not need to allocate one giant slab per request. It allocates blocks, maps them to sequences, and can share blocks for common prefixes.

That gives us a better vocabulary:

block allocation
block table
prefix sharing
copy-on-write
eviction
offload
cache hit confidence
cache-aware routing

Once the cache is block-managed, the routing layer can ask a smarter question: “Which worker already has the most useful prefix blocks, and is it healthy enough to serve this request?”

The fleet-level problem

Inside one worker, KV cache management is a memory allocator problem. Across a fleet, it becomes a distributed systems problem.

The platform needs to know:

which model and tokenizer produced the cached blocks
which worker owns the blocks
whether blocks are still present
whether tenant boundaries allow reuse
whether moving blocks costs more than recomputing
which blocks deserve priority
when stale cache state should be ignored

NVIDIA Dynamo is interesting here because it treats KV cache as control-plane state. Its docs describe KV-aware routing and KVBM-style block management across serving workers. TensorRT-LLM exposes KV cache features such as paged KV cache, reuse, offload, and priority-based retention. LMCache focuses directly on KV cache reuse, storage, and transfer for LLM serving. SGLang’s RadixAttention organizes reusable prefixes so structured generation programs can share state efficiently.

Different projects use different names, but the architectural direction is aligned: KV cache is not a hidden engine detail anymore.

Cache locality is useful only if the worker can still meet the request's latency target.

Eviction is product policy

LRU is fine until the product knows better than the cache.

For LLM systems, not all cached tokens are equally valuable:

system prompts are usually high value
tenant policy packs are high value
stable tool schemas are high value
one-off pasted logs are low value
failed tool branches are low value
private cross-tenant state is not shareable

TensorRT-LLM’s priority-based eviction direction is useful because applications can tell the runtime which ranges matter more. That is the right idea. The agent or gateway often knows more than the memory allocator.

Practical eviction signals:

prefix reuse count
prefix length
tenant scope
freshness class
expected next-turn probability
request priority
memory pressure
recomputation cost
transfer cost

Offload and transfer are not free wins

Offloading KV cache to CPU memory, disk, or a remote cache can increase effective capacity. It also adds latency and bandwidth pressure. Moving a huge prefix may be cheaper than recomputing it, but not always.

Ask:

How many bytes move?
How fast is the interconnect?
Is the request latency-sensitive?
Is the prefix likely to be reused again?
Will offload traffic interfere with model serving?
Does tenant isolation allow this movement?

The right system measures transfer cost and recompute cost. If moving cache is slower than recomputing, do not move it just because the diagram looked elegant.

Metrics that matter

At fleet scale, monitor:

KV cache capacity by worker
active tokens by worker
block allocation failures
fragmentation / wasted blocks
prefix cache hit rate
hit value in tokens avoided
evictions by reason
offload bytes and latency
cache transfer success/failure
stale cache metadata rate
routing decisions by cache score
requests denied by memory pressure
TTFT with and without cache hit

Do not stop at “GPU utilization.” A GPU can be underutilized because it is memory bound, queue constrained, or holding too much active KV state.

The architecture I trust

A mature LLM serving platform should treat KV cache as a first-class resource:

Block-managed cache inside workers.
Prefix hashing that includes model, tokenizer, tenant, and prompt-template version.
Router visibility into cache overlap and queue state.
Conservative fallbacks when cache metadata is stale.
Priority-aware eviction for durable prefixes.
Offload only when transfer cost beats recompute cost.
Metrics that report tokens of work avoided, not just hit count.

The future of inference routing is not round-robin. It is cost-aware placement over memory state.

Sources worth reading

vLLM PagedAttention paper and Automatic Prefix Caching docs.
TensorRT-LLM KV Cache System and NVIDIA KV reuse optimization notes.
NVIDIA Dynamo KV-aware routing and KVBM-related docs.
SGLang RadixAttention for prefix-tree reuse.
LMCache documentation for KV cache storage, reuse, and transfer patterns.
Hugging Face KV cache strategies for cache implementations and tradeoffs.

The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production