Skip to content
KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform

KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform

The KV cache is the memory system inside every serious LLM serving platform.

Model weights get the headlines. GPU FLOPS get the charts. But once you serve real traffic with long prompts, multi-turn sessions, tools, retrieval, and agents, the KV cache becomes the thing you operate every minute.

It is not just “GPU memory.” It is reusable computation state. It decides how many conversations fit, which requests route where, how long prefill takes, whether long-context workloads fall over, and why a node with “free compute” can still reject traffic because memory is gone.

Why the KV cache exists

During autoregressive generation, each new token attends to previous tokens. Recomputing all previous keys and values on every step would be wasteful, so inference engines store key/value tensors from previous tokens.

The cache grows with:

  • batch size
  • sequence length
  • number of layers
  • number of KV heads
  • head dimension
  • precision used for KV tensors
  • active concurrent requests

A simplified formula:

KV bytes per token =
  2                         # key + value
  x layers
  x kv_heads
  x head_dim
  x bytes_per_element

Multiply that by active tokens across the fleet. Now add fragmentation, temporary buffers, prefix sharing, evictions, and offload. Congratulations: you are running a distributed memory manager with a language model attached.

KV cache growth under concurrencyA diagram showing model weights as fixed memory and KV cache growing with active sequences and context length.Weights are fixed. KV cache grows with traffic.At scale, capacity planning is less about one model and more about active tokens held in memory.GPU HBMWeightsWorkspaceShort chat: small active contextLong RAG: many input tokens heldAgent session: repeated turns, sticky statecapacity = weights + workspace + active KV cache + safety headroom
KV cache pressure can make a GPU look underutilized while still being unable to admit more work.

PagedAttention changed the mental model

The vLLM PagedAttention paper borrowed an operating-systems idea: manage KV cache in blocks instead of reserving large contiguous regions per sequence. This reduces fragmentation and makes sharing easier. The authors reported large throughput improvements over earlier serving systems in evaluated workloads because less memory is wasted.

The idea is now table stakes. The system does not need to allocate one giant slab per request. It allocates blocks, maps them to sequences, and can share blocks for common prefixes.

That gives us a better vocabulary:

  • block allocation
  • block table
  • prefix sharing
  • copy-on-write
  • eviction
  • offload
  • cache hit confidence
  • cache-aware routing

Once the cache is block-managed, the routing layer can ask a smarter question: “Which worker already has the most useful prefix blocks, and is it healthy enough to serve this request?”

The fleet-level problem

Inside one worker, KV cache management is a memory allocator problem. Across a fleet, it becomes a distributed systems problem.

The platform needs to know:

  • which model and tokenizer produced the cached blocks
  • which worker owns the blocks
  • whether blocks are still present
  • whether tenant boundaries allow reuse
  • whether moving blocks costs more than recomputing
  • which blocks deserve priority
  • when stale cache state should be ignored

NVIDIA Dynamo is interesting here because it treats KV cache as control-plane state. Its docs describe KV-aware routing and KVBM-style block management across serving workers. TensorRT-LLM exposes KV cache features such as paged KV cache, reuse, offload, and priority-based retention. LMCache focuses directly on KV cache reuse, storage, and transfer for LLM serving. SGLang’s RadixAttention organizes reusable prefixes so structured generation programs can share state efficiently.

Different projects use different names, but the architectural direction is aligned: KV cache is not a hidden engine detail anymore.

Fleet level KV cache managementA router uses cache metadata, queue depth, memory pressure, and tenant scope to select a worker.Fleet-scale KV cache is routing plus memory truthThe best worker is not always the emptiest. It is the one with enough cache value and enough SLO headroom.Routerscore workersWorker Awarm prefix, low queueWorker Bwarm prefix, high queueWorker Ccold prefix, free memoryRouting scorecache value- queue penalty- memory pressure- tenant risk
Cache locality is useful only if the worker can still meet the request's latency target.

Eviction is product policy

LRU is fine until the product knows better than the cache.

For LLM systems, not all cached tokens are equally valuable:

  • system prompts are usually high value
  • tenant policy packs are high value
  • stable tool schemas are high value
  • one-off pasted logs are low value
  • failed tool branches are low value
  • private cross-tenant state is not shareable

TensorRT-LLM’s priority-based eviction direction is useful because applications can tell the runtime which ranges matter more. That is the right idea. The agent or gateway often knows more than the memory allocator.

Practical eviction signals:

  • prefix reuse count
  • prefix length
  • tenant scope
  • freshness class
  • expected next-turn probability
  • request priority
  • memory pressure
  • recomputation cost
  • transfer cost

Offload and transfer are not free wins

Offloading KV cache to CPU memory, disk, or a remote cache can increase effective capacity. It also adds latency and bandwidth pressure. Moving a huge prefix may be cheaper than recomputing it, but not always.

Ask:

  • How many bytes move?
  • How fast is the interconnect?
  • Is the request latency-sensitive?
  • Is the prefix likely to be reused again?
  • Will offload traffic interfere with model serving?
  • Does tenant isolation allow this movement?

The right system measures transfer cost and recompute cost. If moving cache is slower than recomputing, do not move it just because the diagram looked elegant.

Metrics that matter

At fleet scale, monitor:

  • KV cache capacity by worker
  • active tokens by worker
  • block allocation failures
  • fragmentation / wasted blocks
  • prefix cache hit rate
  • hit value in tokens avoided
  • evictions by reason
  • offload bytes and latency
  • cache transfer success/failure
  • stale cache metadata rate
  • routing decisions by cache score
  • requests denied by memory pressure
  • TTFT with and without cache hit

Do not stop at “GPU utilization.” A GPU can be underutilized because it is memory bound, queue constrained, or holding too much active KV state.

The architecture I trust

A mature LLM serving platform should treat KV cache as a first-class resource:

  1. Block-managed cache inside workers.
  2. Prefix hashing that includes model, tokenizer, tenant, and prompt-template version.
  3. Router visibility into cache overlap and queue state.
  4. Conservative fallbacks when cache metadata is stale.
  5. Priority-aware eviction for durable prefixes.
  6. Offload only when transfer cost beats recompute cost.
  7. Metrics that report tokens of work avoided, not just hit count.

The future of inference routing is not round-robin. It is cost-aware placement over memory state.

Sources worth reading