Why Round-Robin Dies in LLM Serving: KV-Aware Routing Explained
Round-robin is a wonderful baseline.
It is easy to explain, easy to implement, and hard to misconfigure in funny ways. Every request gets its turn. Every backend feels included. The dashboard looks fair. Everyone goes home.
Then you put an LLM behind it.
Suddenly, fairness by request count is not fairness by cost. One request may contain a short prompt and emit ten tokens. Another may carry a 40,000-token history, a system prompt, tool schemas, and retrieved documents. A third may be the next turn of a session whose prefix is already warm on one worker.
Treating those three requests as equal is not simplicity. It is accounting fraud with better branding.
KV-aware routing exists because the expensive work in LLM serving is not distributed evenly across requests. Some work has already been paid for. A good router should notice.
The core problem: prefill is expensive and repeatable
LLM generation has two broad phases.
Prefill processes the input prompt and creates KV cache. Decode uses that cache to generate output tokens one step at a time.
If a later request shares a large prefix with an earlier request, it may be able to reuse KV cache. That reuse can reduce prefill work and improve time to first token. The catch is simple: the request has to land where the reusable state exists, or the system needs an efficient way to find and move that state.
That is the moment where classic load balancing starts to wobble.
Round-robin sees workers.
KV-aware routing sees workers plus memory of prior computation.
What Dynamo’s router actually considers
The Dynamo router guide describes the KV Router as evaluating computational costs across workers. It considers decoding costs from active blocks and prefill costs from newly computed blocks, using KV cache overlap to reduce redundant computation.
That last phrase matters: the router is not merely sticky sessions with a nicer name.
Sticky sessions say, “Send this user back to the same worker.”
KV-aware routing says, “Estimate the cost of sending this prompt to each worker, accounting for cache overlap and load, then choose the lowest-cost safe option.”
That is a better abstraction. It can handle repeated system prompts across many users, tool schemas reused across sessions, agent turns with shared histories, and workers whose cache state changes over time.
The docs list several routing modes, including round-robin, random, KV, least-loaded, device-aware weighted, direct, and standalone router configurations. KV mode is the interesting production path for cache reuse because it evaluates cache overlap and load. There are also multiple KV event transport modes: NATS Core, JetStream, ZMQ, and an approximate no-events mode where the router predicts cache state from its own decisions with TTL expiration.
That design is nicely pragmatic. Real systems do not always get perfect state. Dynamo gives you options along the freshness-versus-operational-complexity spectrum.
A routing score is not a religion
The most important guardrail: cache hit value is a signal, not a commandment.
A worker with a warm prefix may still be the wrong destination if it is overloaded, memory pressured, or likely to violate the request’s SLO. A cold worker with lots of headroom may be better for a tiny prompt. A warm worker may be perfect for a long repeated prompt. A router that always chases cache locality can accidentally create a hot spot.
The useful mental model is a scorecard.
A simplified score might look like this:
route_score =
cache_overlap_value
- estimated_prefill_cost
- decode_queue_penalty
- memory_pressure_penalty
- topology_penalty
- SLO_riskDo not copy that into production and blame me when a GPU develops opinions. The point is conceptual: routing is a cost model, not a turn-taking game.
Why prompt overlap is more valuable than request count
Imagine an enterprise assistant with a 6,000-token system prompt and 12,000 tokens of tools, examples, policy, and retrieved context. The next user turn adds only 300 new tokens.
If the request lands on a worker with the 18,000-token prefix already cached, the system may only need to process the new suffix. If it lands cold, the system may redo the whole prefix. From the load balancer’s point of view, both paths are “one request.” From the GPU’s point of view, those paths are not even cousins.
This is why request count is a weak load signal for LLMs.
Better signals include:
- input sequence length
- output sequence length
- prefix overlap
- active decode blocks
- prefill tokens in flight
- KV cache occupancy
- expected cache reuse
- queue depth per latency class
- tenant or model boundary
Dynamo’s router guide exposes several related knobs, including --router-mode kv, --router-kv-overlap-score-weight, --router-track-prefill-tokens, queue thresholds, and queue policies. The exact tuning depends on workload, but the direction is clear: count tokens and state, not just requests.
The funny failure modes
KV-aware routing can fail in creative ways if the platform gets too confident.
Stale cache map. The router believes worker A has blocks that were evicted milliseconds ago. The route looks clever until prefill starts from scratch.
Hot prefix overload. A popular prefix attracts too much traffic to a small set of workers. Cache wins locally, queue loses globally.
Tenant boundary mistake. A cache hit is not useful if sharing would violate isolation rules. The cache key must include security boundaries unless sharing is explicitly allowed.
Tokenizer mismatch. Prefix hashes are meaningful only for compatible model and tokenizer behavior. Similar text is not enough.
Memory pressure inversion. A request lands on a warm worker, but accepting it evicts more valuable cache needed by active sessions.
The answer is not to give up on KV-aware routing. The answer is to make confidence visible and conservative. If cache state is stale, reduce its score. If two routes are close, choose the healthier worker. If tenant boundaries are unclear, do not share. If the hot prefix is cooking one worker, cap the affinity bonus.
Aggregated and disaggregated routing are different games
Dynamo supports aggregated and disaggregated topologies.
In aggregated mode, a single worker pool handles the full request lifecycle. KV-aware routing can pick a worker based on cache overlap and current load.
In disaggregated mode, prefill and decode live in separate pools. The frontend routes to a prefill worker first, then to a decode worker. That makes the routing problem richer because the platform must reason about prompt work, decode load, KV handoff, and topology.
The basic principle remains the same:
Send work where total cost is lowest, not where the next turn in a rotation happens to point.
But the cost model needs more inputs. Is the prefill pool saturated? Is the decode pool waiting on KV transfer? Are prefill and decode workers physically close enough for the split to pay off? Is the prefix already warm somewhere that can serve decode efficiently?
This is where Dynamo’s architecture looks less like a load balancer and more like an inference operating layer.
A practical rollout pattern
If I were introducing KV-aware routing in a production environment, I would not start with maximum cleverness.
I would start with four milestones.
1. Establish the baseline. Measure round-robin or least-loaded behavior by prompt length, TTFT, TPOT, cache hit rate, and tail latency. Do not use average latency alone. Averages are where tails go to hide.
2. Enable KV-aware routing for a controlled workload. Pick a model and traffic class with obvious prefix reuse: multi-turn chat, tool-heavy agents, repeated system prompts, or coding assistants.
3. Observe route explanations. Record why a route won: overlap, estimated prefill savings, queue penalty, memory pressure, and fallback reason. A router that cannot explain itself is an incident report waiting to happen.
4. Add guardrails. Bound cache affinity, respect tenant boundaries, expose stale-state confidence, and define fallbacks when event streams are unhealthy.
The success metric is not “KV-aware routing is enabled.” The success metric is that users get faster first tokens and the fleet spends less GPU work recomputing prefixes.
Where NVIDIA’s stack has a useful advantage
The strong part of Dynamo’s approach is not that it invented cache reuse. Engines and frameworks already have excellent cache stories.
The strong part is that Dynamo tries to lift cache awareness into the distributed serving layer. The router can make cache-aware decisions. KV events make worker state visible. KVBM gives a path for memory-tier-aware KV management. NIXL gives the broader stack a data movement foundation.
That combination matters because a single-worker cache optimization is valuable, but a fleet-level cache optimization changes the economics of the service.
This is also where the NVIDIA ecosystem has a clean platform story. The same company building optimized GPU inference engines is also building the routing, memory, transfer, and scaling pieces around them. You can still use other engines, and the docs explicitly discuss backend-agnostic support. But the full-stack direction is obvious and, frankly, sensible.
The rule of thumb
Round-robin is fine when requests are equal, stateless, and cheap to repeat.
LLM requests are rarely all three.
Once prompts get long, sessions get sticky, agents reuse tools, and KV cache becomes meaningful, routing has to become cost-aware. The router should know which worker already paid the prefill bill, whether that worker is still healthy enough to serve the request, and when the cache win is not worth the queue risk.
That is why round-robin dies in serious LLM serving.
Not because it is bad software.
Because it is asking a web-era question in a token-era system.
Sources and receipts
- Dynamo routing modes, KV routing, and event transport options: KV Cache Aware Routing.
- Dynamo architecture notes on KV-aware routing and cache state: Overall Architecture.
- Backend support caveats: Feature Matrix.
