Skip to content
Dynamo Is Not an Inference Engine. It Is the Control Plane for Tokens

Dynamo Is Not an Inference Engine. It Is the Control Plane for Tokens

The easiest way to misunderstand Dynamo is to put it in the wrong box.

If the box says “inference engine,” the conversation immediately becomes a fake horse race:

  • Dynamo versus vLLM
  • Dynamo versus SGLang
  • Dynamo versus TensorRT-LLM

That framing is tidy, clickable, and mostly wrong.

Dynamo is better understood as the distributed runtime around the engines. The engines still do the local token math: batching, attention kernels, decode scheduling, memory layout, and GPU execution. Dynamo handles the fleet-shaped problems that appear once a model service becomes real: routing, cache visibility, prefill and decode placement, autoscaling, KV transfer, failure handling, and observability.

The engine makes tokens.

Dynamo tries to make token production reliable, cheap, and boring at service scale. Boring is a compliment here. In production, boring means fewer 2 a.m. mysteries.

Dynamo control plane stack showing clients, frontend, router, inference engines, and the request, control, and state planes
Dynamo coordinates the serving system around engines such as vLLM, SGLang, and TensorRT-LLM.

The engine is not the whole service

An inference engine is already a serious piece of software. It decides how requests are batched, how KV cache is allocated inside a worker, how attention runs, how tokens are streamed, and how the GPU stays fed. That work is difficult. No serious production team should trivialize it.

But a distributed LLM service has a second layer of difficulty.

The request that enters the platform is not just “run model.” It has a tenant, an SLO, a prompt length, a potential cache hit, a session history, a cancellation path, a stream consumer, and maybe an agent loop waiting on tool calls. The service has multiple workers, changing queues, GPU memory pressure, cache events, model replicas, prefill-heavy requests, decode-heavy requests, and failures that arrive with excellent timing.

That is where the control plane matters.

The Dynamo documentation makes this split explicit. It describes Dynamo as a distributed inference runtime for generative AI systems, backend-agnostic across SGLang, TensorRT-LLM, vLLM, and others. It also describes three cooperating concerns: a fast request path, a responsive control path, and a resilient state path. That is the language of a distributed system, not just a kernel optimizer.

The three planes are the mental model

I like to think about Dynamo through three planes.

PlaneWhat it ownsWhy it matters
Request planeFrontend, router, workers, token streamingKeeps the user-facing path fast and predictable
Control planePlanner, operator, discovery, scaling targetsAligns capacity with demand instead of hoping static sizing survives traffic
Storage and events planeKV events, KVBM, NIXL transferMakes cache reuse and cross-worker state movement possible

This split is useful because it stops one component from being asked to do everything.

The frontend should not be a hidden autoscaler. The router should not guess cache state forever without events. The engine should not be responsible for cluster-wide placement. The autoscaler should not scale solely from a generic machine metric and call it inference-aware.

Good systems get boring by giving each part a crisp job.

Why “above the engine” is the right place

There is a practical reason this layer belongs above the engine: production teams do not all standardize on the same backend.

Some teams like vLLM because it is fast-moving, widely adopted, and friendly to experimentation. Some teams like SGLang because its runtime model is strong for structured and high-throughput serving patterns. Some teams want TensorRT-LLM because they care deeply about optimized performance on NVIDIA platforms. Many serious teams run more than one backend over time.

If routing, scaling, KV movement, and observability are welded into one engine-specific stack, every backend change becomes a platform migration.

Dynamo’s more interesting bet is that the fleet layer should be modular. You can adopt the router, planner, KVBM, or NIXL-oriented pieces as needed, then compose more of the system over time. The official docs call out modular adoption directly, including the idea of starting with a single component such as the router.

That is a very practical design choice. It meets teams where they are, which is usually somewhere between “we have a working model server” and “we have accidentally invented a small distributed operating system around it.”

Boundary between inference engine responsibilities and Dynamo distributed runtime responsibilities
The engine optimizes local token generation. The runtime optimizes the service around many workers.

The request path is only half the story

The request path has to be fast. No argument there. When a user is waiting for a first token, nobody wants a philosopher-router writing a dissertation about cache locality.

But LLM serving also needs control loops:

  • a serving loop that keeps frontend, router, prefill, and decode workers moving
  • a planning loop that watches metrics and computes prefill/decode targets
  • a resilience loop that drains unhealthy endpoints, sheds load, and avoids cascading failure

That last word, “loop,” is important. A production inference service is not configured once. It is constantly adapting.

Traffic changes. Prompt length changes. Output length changes. A new product feature adds a giant tool schema. A batch of agent requests starts doing long-context research. A model update changes KV cache pressure. A GPU node disappears at the most theatrical possible moment.

Static architecture diagrams do not survive contact with Tuesday afternoon.

Dynamo’s Planner, Router, KVBM, and event paths are interesting because they form feedback channels. The service sees more of itself, and then it can make better decisions.

KV cache turns routing into state management

Traditional load balancers are comfortable with statelessness. Pick a backend. Move on. Maybe add least-connections. Maybe add weights. It has worked for years because web requests are usually not carrying a massive hidden computation debt from previous turns.

LLM requests are different.

If a worker already has KV cache for a prompt prefix, routing the next related request there can skip expensive prefill work. If the router sends the request somewhere cold, the platform recomputes a prefix it may have already paid for.

That is why cache state becomes a first-class part of serving. Dynamo’s router can use KV cache overlap and load. KVBM can manage KV blocks across memory tiers. NIXL provides a data movement layer for KV and other transfers. KV events make cache lifecycle visible to the rest of the system.

This is the part of inference that feels less like old HTTP and more like a distributed database with GPUs attached. The data is not rows. It is reusable model state.

Disaggregation needs a runtime, not only a flag

Disaggregated serving separates prefill and decode. That sounds simple until you try to operate it.

Prefill wants to process long prompts quickly. Decode wants to emit one token at a time with tight latency. Those phases stress the system differently, so splitting them can improve utilization and scaling. But the split creates new obligations:

  • pick a prefill worker
  • compute KV state
  • move or expose that state to a decode worker
  • pick a decode worker
  • stream tokens back
  • update cache visibility for future turns
  • handle the failure cases without turning the service into soup

Dynamo’s architecture documentation describes this disaggregated request narrative directly: frontend, router, prefill, transfer metadata, decode, token stream, KV events, and possible KVBM offload or recall.

The point is not that every workload must be disaggregated. The point is that once you do disaggregate, the coordination layer becomes the product.

Compatibility matters, and details matter more

Dynamo’s docs are careful about backend support, and the blog should be careful too.

The feature matrix says disaggregated serving, KV-aware routing, and SLA-based Planner support exist across SGLang, TensorRT-LLM, and vLLM. It also notes that feature depth varies. For example, KVBM support is listed for TensorRT-LLM and vLLM in the quick comparison, while the detailed notes give backend-specific caveats around multimodal routing, request migration, LoRA, and speculative decoding.

That is exactly how production platforms usually evolve. The architecture may be clean, but support is still shaped by backend integration work.

So the honest takeaway is not “everything works everywhere the same way.” The honest takeaway is better:

Dynamo is trying to create a common distributed serving layer while allowing engines to keep their own strengths.

That is the right ambition. It is also the hard one.

What I would look for in a real deployment

If I were reviewing a Dynamo deployment, I would not start by asking whether the YAML looks advanced. I would ask operational questions:

  • What are the TTFT and TPOT targets?
  • Which requests are prefill-heavy versus decode-heavy?
  • How often does prefix reuse actually happen?
  • Does the router have fresh enough cache state?
  • Are tenant boundaries part of the cache key?
  • What happens when a worker fails mid-request?
  • Is scale-down safe for in-flight work?
  • Are Planner targets based on the right workload shape?
  • Can the team explain why a request went to a specific worker?
  • Are backend-specific feature gaps documented before rollout?

The best inference systems are explainable to operators. If the only answer to “why did this request go there?” is a shrug and a log line from three services ago, the control plane is not done yet.

The slightly opinionated take

Dynamo is interesting because it treats inference as a systems problem.

That sounds obvious, but many stacks still treat model serving like a web service with a large GPU-shaped dependency. Add an endpoint, add replicas, add a load balancer, add dashboards, hope for the best.

Hope is not a scheduling algorithm.

LLM serving needs to understand tokens, KV cache, prompt length, decode pressure, topology, and SLOs. Agentic serving adds priority, speculative prefill, tool gaps, and cache lifecycle hints. At that point the platform needs a real control plane.

This is where Dynamo’s design lands well. It does not ask the engine to stop being excellent at local inference. It asks the surrounding system to stop being naive.

That is the right shape for the AI cloud: powerful engines below, a token-aware runtime around them, and enough visibility that operators can keep the whole thing boring.

Sources and receipts