Skip to content
KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs

KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs

Round-robin is a beautiful algorithm. Simple, predictable, easy to debug. It is also the wrong default for serious LLM inference.

The reason is cache locality.

In normal HTTP, two identical requests can usually go to different backends and nobody cries. In LLM serving, two related requests may share a giant prefix: system prompt, tool schema, conversation history, retrieved documents, repository context. If the second request lands on a backend with warm KV cache, it can skip expensive recomputation. If it lands somewhere else, congratulations, you bought the same prefix twice.

That is not load balancing. That is load laundering.

What KV-aware routing means

KV-aware routing sends a request to a backend based partly on whether that backend already has useful KV cache for the prompt prefix.

The router considers:

  • model identity
  • tokenizer compatibility
  • prompt prefix hash
  • tenant or session affinity
  • backend queue depth
  • GPU memory pressure
  • cache occupancy
  • expected input/output tokens
  • SLO class
KV-aware routingRouter compares round-robin routing with cache-aware routing to choose a backend with a warm prefix.The best backend is not always the emptiestRouterprefix + loadBackend Acold prefixBackend Bwarm prefixSaved workreuse KV blocksround-robin asks: whose turn?KV-aware asks: who already paid the prefill bill?
Cache locality turns routing from queue balancing into cost avoidance.

The key is that cache hit value must be balanced against load. A backend with a perfect cache hit but a terrible queue may still be the wrong choice. A backend with no cache but lots of headroom may win for a tiny prompt. Routing becomes a scoring problem.

Why prefixes matter so much

Many modern applications repeat large prefixes:

  • customer support system prompts
  • tool schemas
  • agent instructions
  • coding repository context
  • RAG documents
  • legal or financial policy packs
  • multi-turn conversation history

The repeated prefix can be thousands of tokens. Without cache reuse, every turn pays prefill again. With reuse, the next request only processes the new suffix. This is why prefix caching in vLLM, RadixAttention in SGLang, and KV reuse controls in TensorRT-LLM are not small runtime details. They are economics features.

SGLang’s RadixAttention is especially intuitive: it organizes reusable prefixes so complex multi-call programs can share KV cache across branches. TensorRT-LLM’s KV cache event API gives upstream systems visibility into cache state. Dynamo’s KV-aware router and KV block manager move the idea from one engine into a distributed serving stack.

This is the slightly boring, very important future: the gateway and runtime need to share cache state.

What a router should score

A practical scoring function can start simple:

score =
  cache_hit_value
  - queue_penalty
  - memory_pressure_penalty
  - topology_penalty
  - SLO_risk

Cache hit value is not binary. A 500-token shared prefix and a 20,000-token shared prefix are not equal. A cache hit on a system prompt might be useful across thousands of users. A cache hit on a one-off pasted log file might be worthless ten seconds later.

KV-aware routing scoreA routing score balances cache value against queue, memory, topology, and SLO penalties.Routing score: cache is useful, but not absoluteCache hit valueprefix size x reuse-QueueMemoryTopologySLOpick highest safe score
Cache affinity is a signal. It should not become a religion.

Session affinity is the blunt instrument

Sticky sessions help because multi-turn chat often reuses recent context. They are also blunt. If a backend becomes overloaded, sticky routing can make one GPU sad while its neighbors enjoy spa day.

Better options:

  • consistent hashing on stable prefix
  • cache-aware routing with load thresholds
  • session affinity with escape hatches
  • centralized or eventually consistent cache state
  • shared KV cache layers when supported

The SGLang Model Gateway documentation recommends cache-aware routing by default and also discusses session affinity when cache efficiency matters. Kubernetes Gateway API Inference Extension talks about request scheduling that is KV-cache and request-cost aware. TensorRT-LLM’s KV event API exists precisely so upstream applications can understand cache state.

The direction is clear: load balancers are becoming cache managers.

Implementation guardrails

The first version does not need to be fancy. It needs to be hard to fool:

  • Hash prefixes only after tokenizer and model identity are known.
  • Keep cache state freshness visible; stale maps should reduce confidence automatically.
  • Put tenant boundaries into the cache key unless sharing is explicitly allowed.
  • Cap the cache-affinity bonus so one hot prefix cannot overload a worker forever.
  • Record why a route was chosen, including cache score and queue penalty.
  • Fall back to load-based routing when cache confidence is low.

The last bullet matters most. KV-aware routing should improve the default path, not become a single point of creative failure.

Failure modes

KV-aware routing can go wrong in fun ways:

Stale cache map. The router thinks a backend has blocks that were evicted 200 milliseconds ago.

Hot prefix overload. One popular prefix attracts too much traffic to a small set of workers.

Memory pressure inversion. A cache hit lands on a backend that must evict more valuable blocks to accept the request.

Tokenizer mismatch. Prefix hashes are only meaningful for the same tokenizer and model variant.

Privacy boundaries. Shared cache must respect tenant isolation. A cache hit is not worth a compliance incident.

This is why the implementation needs conservative fallbacks. If cache state is uncertain, route by cost and load. If tenant boundaries are unclear, do not share. If the score is close, pick the backend with healthier SLO margin.

The NVIDIA stack

NVIDIA’s advantage here is that TensorRT-LLM and Dynamo expose the right hooks. TensorRT-LLM gives fine-grained KV cache behavior inside the engine. Dynamo adds distributed routing, KV transfer, and block management across a fleet. On NVLink-connected systems, the hardware fabric makes cache movement and multi-GPU coordination much more attractive than it would be on a weak interconnect.

That does not make vLLM or SGLang second-class. Both are excellent and both have serious cache stories. The bigger point is that routing is becoming a first-class part of inference. NVIDIA is leaning into that whole-stack view aggressively, and I think that is the correct bet.

Closing

The old load balancer wanted to spread requests evenly. The inference router wants to spend computation wisely.

Once prompts are long, conversations are multi-turn, and agents reuse tool schemas all day, cache locality becomes money. Round-robin does not know that. A KV-aware router does.

The future gateway will ask a better question before routing:

Where can this request be served with the least new work while staying inside SLO?

That is the question. Everything else is just moving packets with confidence.

Sources and receipts