Disaggregated Inference on Kubernetes: Routing, Scheduling, and Scaling Beyond One GPU

#inference #kubernetes #disaggregated-serving #gpu #gateway-api #dynamo #llm-d #vllm

At small scale, LLM serving is charmingly direct: run a model server, put an API in front of it, send traffic. At larger scale, the charm exits quietly and leaves you with topology, queues, cache transfer, device placement, and a dashboard that suddenly has opinions.

Disaggregated inference is what happens when we admit that prefill and decode are different jobs and stop forcing them onto the same worker. Kubernetes is what happens when we try to make that admission operationally survivable.

This post is the practical architecture: what runs where, what the gateway needs to know, and why plain round-robin starts looking very 2016.

The disaggregated shape

A disaggregated serving path normally has:

A gateway or router that accepts OpenAI-compatible traffic.
Prefill workers that process input prompts and build KV cache.
Decode workers that generate output tokens.
A KV transfer path between phases.
A scheduler or planner that keeps the ratio of prefill to decode capacity sane.

Disaggregation turns one model server into a small distributed system. Cute on whiteboards, spicy in production.

The gateway cannot just pick a pod. It needs to ask:

Which model or adapter is requested?
How many input tokens are coming?
Is this an interactive request or batch request?
Is there existing KV cache affinity?
Which decode pool has token-per-second headroom?
Which placement avoids expensive cross-rack KV movement?

That is why Kubernetes SIG Network created the Gateway API Inference Extension. It adds inference-specific routing concepts on top of the Gateway API model. The project documentation explicitly calls out model-aware routing, cost-aware scheduling, and KV-cache-aware behavior. Translation: the platform community is acknowledging that a Service and round-robin are not enough.

Scheduling is now part of serving

Classic Kubernetes scheduling mostly decides where a pod should start. LLM inference needs more. You need placement decisions that understand accelerators, topology, model shape, and runtime state.

Kubernetes Dynamic Resource Allocation is an important step because it gives the platform a richer way to request and match devices such as GPUs. That does not solve inference by itself, but it gives higher-level systems a better substrate. A pod asking for “some GPU” is different from a workload asking for devices with a specific capability, topology, or sharing behavior.

NVIDIA’s recent work around GPU DRA and the broader Kubernetes ecosystem direction matter here because inference clusters are becoming less like fleets of generic nodes and more like carefully wired compute fabrics. On a 72-GPU NVLink domain, topology is not decoration. It is the shape of your performance envelope.

The KV transfer bill

Disaggregation moves KV cache between prefill and decode. That transfer is the tax you pay for separating the phases. Sometimes the tax is worth it. Sometimes it eats the benefit.

The right question is not “Should we disaggregate?” It is:

benefit from independent scaling
- cost of KV transfer
- operational complexity
= whether this is a good idea

If KV transfer is slow, disaggregation becomes a very fancy way to move the bottleneck.

This is where high-bandwidth interconnects and runtime-level KV movement matter. NVIDIA Dynamo includes NIXL for low-latency inference data transfer and KV-cache management across tiers. llm-d integrates vLLM with Kubernetes-native routing and distributed serving ideas. The ecosystem is converging on the same answer: inference needs a control plane that understands tokens and cache, not just HTTP.

Autoscaling gets weird

Horizontal Pod Autoscaling based on CPU is mostly irrelevant for GPU inference. Even GPU utilization alone can lie. A decode worker may show decent utilization while queueing causes bad TPOT. A prefill pool may look idle and then get flattened by a burst of long prompts.

Useful scaling signals include:

queued requests by phase
prompt token backlog
decode token backlog
TTFT and TPOT percentiles
KV cache occupancy
cache hit rate
per-model and per-tenant demand
accelerator memory pressure

The autoscaler should not merely add replicas. It should adjust the prefill/decode ratio. If output lengths grow, decode needs more capacity. If input contexts grow, prefill and KV transfer become the pressure points.

This is why I like the direction of Dynamo’s SLO planner and Kubernetes-native inference gateways. They move the conversation from “pods are up” to “the service objective is being met.” That is the right level of abstraction.

What can go wrong

Disaggregation fails when teams split the architecture before they can observe the phases. The common traps are:

KV transfer becomes the new bottleneck. The diagram looks distributed; the trace looks like waiting.
Autoscaling chases the wrong pool. Prefill grows while decode is the pressure point, or the reverse.
Topology is ignored. A request crosses a slow boundary because the scheduler saw available GPUs, not the fabric between them.
Cache ownership is unclear. Nobody knows whether the prefill worker, decode worker, or router owns eviction decisions.
Fallback paths are missing. When the KV transfer path is unhealthy, the service should degrade to aggregated serving or reject gracefully, not improvise.

The test I like: turn off one decode pool, double the long-context traffic, and watch whether the planner explains itself. If the dashboards only say “pods restarted,” the platform is not ready for this architecture yet.

A pragmatic adoption path

Do not jump straight to the fanciest architecture because a diagram looked handsome.

Start here:

Run aggregated serving with strong metrics.
Add prefix caching and sensible batching.
Enable chunked prefill if long prompts hurt streaming.
Add an inference-aware gateway when routing by model and request cost matters.
Move to disaggregated serving when prefill/decode interference is proven.
Add topology-aware placement when KV transfer becomes material.

NVIDIA’s stack is especially compelling at steps 4-6 because the pieces line up: optimized engines, GPU-aware deployment, cache-aware distributed serving, and hardware designed for high-bandwidth scale-up. But the architecture principle is portable. Measure the phases. Route with context. Schedule for SLOs.

Kubernetes is not magically “good at AI” just because a pod has a GPU. Kubernetes becomes good at AI when the APIs above it understand what AI workloads actually need.

That is the work now. Less YAML poetry, more token-aware control loops.

Sources and receipts

Kubernetes Gateway API Inference Extension: Kubernetes announcement and project repository.
Kubernetes Dynamic Resource Allocation: official docs.
NVIDIA Dynamo: Dynamo overview and Dynamo architecture.
NVIDIA disaggregated inference on Kubernetes: technical blog.
llm-d: project site and GitHub repository.
DistServe: research paper.

Prefill vs Decode: The Hidden Split That Shapes Every LLM Serving Architecture Rust for Systems Programming: When the Borrow Checker Earns Its Keep