KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs

#inference #kv-cache #routing #load-balancing #gateway #tensorrt-llm #dynamo #sglang

Round-robin is a beautiful algorithm. Simple, predictable, easy to debug. It is also the wrong default for serious LLM inference.

The reason is cache locality.

In normal HTTP, two identical requests can usually go to different backends and nobody cries. In LLM serving, two related requests may share a giant prefix: system prompt, tool schema, conversation history, retrieved documents, repository context. If the second request lands on a backend with warm KV cache, it can skip expensive recomputation. If it lands somewhere else, congratulations, you bought the same prefix twice.

That is not load balancing. That is load laundering.

What KV-aware routing means

KV-aware routing sends a request to a backend based partly on whether that backend already has useful KV cache for the prompt prefix.

The router considers:

model identity
tokenizer compatibility
prompt prefix hash
tenant or session affinity
backend queue depth
GPU memory pressure
cache occupancy
expected input/output tokens
SLO class

Cache locality turns routing from queue balancing into cost avoidance.

The key is that cache hit value must be balanced against load. A backend with a perfect cache hit but a terrible queue may still be the wrong choice. A backend with no cache but lots of headroom may win for a tiny prompt. Routing becomes a scoring problem.

Why prefixes matter so much

Many modern applications repeat large prefixes:

customer support system prompts
tool schemas
agent instructions
coding repository context
RAG documents
legal or financial policy packs
multi-turn conversation history

The repeated prefix can be thousands of tokens. Without cache reuse, every turn pays prefill again. With reuse, the next request only processes the new suffix. This is why prefix caching in vLLM, RadixAttention in SGLang, and KV reuse controls in TensorRT-LLM are not small runtime details. They are economics features.

SGLang’s RadixAttention is especially intuitive: it organizes reusable prefixes so complex multi-call programs can share KV cache across branches. TensorRT-LLM’s KV cache event API gives upstream systems visibility into cache state. Dynamo’s KV-aware router and KV block manager move the idea from one engine into a distributed serving stack.

This is the slightly boring, very important future: the gateway and runtime need to share cache state.

What a router should score

A practical scoring function can start simple:

score =
  cache_hit_value
  - queue_penalty
  - memory_pressure_penalty
  - topology_penalty
  - SLO_risk

Cache hit value is not binary. A 500-token shared prefix and a 20,000-token shared prefix are not equal. A cache hit on a system prompt might be useful across thousands of users. A cache hit on a one-off pasted log file might be worthless ten seconds later.

Cache affinity is a signal. It should not become a religion.

Session affinity is the blunt instrument

Sticky sessions help because multi-turn chat often reuses recent context. They are also blunt. If a backend becomes overloaded, sticky routing can make one GPU sad while its neighbors enjoy spa day.

Better options:

consistent hashing on stable prefix
cache-aware routing with load thresholds
session affinity with escape hatches
centralized or eventually consistent cache state
shared KV cache layers when supported

The SGLang Model Gateway documentation recommends cache-aware routing by default and also discusses session affinity when cache efficiency matters. Kubernetes Gateway API Inference Extension talks about request scheduling that is KV-cache and request-cost aware. TensorRT-LLM’s KV event API exists precisely so upstream applications can understand cache state.

The direction is clear: load balancers are becoming cache managers.

Implementation guardrails

The first version does not need to be fancy. It needs to be hard to fool:

Hash prefixes only after tokenizer and model identity are known.
Keep cache state freshness visible; stale maps should reduce confidence automatically.
Put tenant boundaries into the cache key unless sharing is explicitly allowed.
Cap the cache-affinity bonus so one hot prefix cannot overload a worker forever.
Record why a route was chosen, including cache score and queue penalty.
Fall back to load-based routing when cache confidence is low.

The last bullet matters most. KV-aware routing should improve the default path, not become a single point of creative failure.

Failure modes

KV-aware routing can go wrong in fun ways:

Stale cache map. The router thinks a backend has blocks that were evicted 200 milliseconds ago.

Hot prefix overload. One popular prefix attracts too much traffic to a small set of workers.

Memory pressure inversion. A cache hit lands on a backend that must evict more valuable blocks to accept the request.

Tokenizer mismatch. Prefix hashes are only meaningful for the same tokenizer and model variant.

Privacy boundaries. Shared cache must respect tenant isolation. A cache hit is not worth a compliance incident.

This is why the implementation needs conservative fallbacks. If cache state is uncertain, route by cost and load. If tenant boundaries are unclear, do not share. If the score is close, pick the backend with healthier SLO margin.

The NVIDIA stack

NVIDIA’s advantage here is that TensorRT-LLM and Dynamo expose the right hooks. TensorRT-LLM gives fine-grained KV cache behavior inside the engine. Dynamo adds distributed routing, KV transfer, and block management across a fleet. On NVLink-connected systems, the hardware fabric makes cache movement and multi-GPU coordination much more attractive than it would be on a weak interconnect.

That does not make vLLM or SGLang second-class. Both are excellent and both have serious cache stories. The bigger point is that routing is becoming a first-class part of inference. NVIDIA is leaning into that whole-stack view aggressively, and I think that is the correct bet.

Closing

The old load balancer wanted to spread requests evenly. The inference router wants to spend computation wisely.

Once prompts are long, conversations are multi-turn, and agents reuse tool schemas all day, cache locality becomes money. Round-robin does not know that. A KV-aware router does.

The future gateway will ask a better question before routing:

Where can this request be served with the least new work while staying inside SLO?

That is the question. Everything else is just moving packets with confidence.

Sources and receipts

TensorRT-LLM KV cache events and priority eviction: NVIDIA technical blog.
NVIDIA Dynamo KV-aware routing and KV block management: Dynamo overview and Dynamo architecture docs.
vLLM prefix caching and engine internals: Inside vLLM and vLLM docs.
SGLang RadixAttention: SGLang paper, SGLang docs, and SGLang Model Gateway.
Kubernetes Gateway API Inference Extension: project repository.

Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per Second Inference Is Not HTTP: The Case for a Purpose-Built Gateway in Rust