Skip to content
From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem

From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem

Disaggregated inference sounds wonderfully clean when said quickly.

“Separate prefill and decode.”

There. Done. Architecture solved. Someone order the celebratory coffee.

Except the hard part is not drawing two boxes. The hard part is making those boxes behave like one reliable service while KV state moves between them, queues change shape, workers fail, and users still expect first tokens to arrive before they finish wondering if the app froze.

Prefill and decode are different workloads. Splitting them can be powerful. It can also add enough coordination cost to eat the benefit if the platform is not designed for it.

That is why disaggregated inference is not just an engine feature. It is a distributed systems problem with GPUs in the middle.

Disaggregated inference flow with request, prefill pool, KV transfer path, decode pool, and streaming response
Disaggregation separates prompt processing from token generation, then pays the price of moving state.

Prefill and decode stress the system differently

Prefill processes the input prompt. It is bursty, parallelizable, and often dominated by the size of the input sequence. Long context, big tool schemas, large retrieved documents, and multi-turn conversation history all make prefill heavier.

Decode generates output tokens one at a time. It is latency-sensitive, sequential, and shaped by active sessions. It cares about time per output token, fairness across streams, and keeping GPU work scheduled efficiently while many requests are mid-generation.

Putting both phases in one worker pool is simple. Simplicity is valuable. But simple can become expensive when one phase interferes with the other.

For example:

  • long prompts can delay decode work for users already waiting on streamed tokens
  • decode-heavy traffic can occupy memory and scheduler attention while prefill waits
  • batch shape can become awkward because prefill and decode have different compute profiles
  • scaling one phase means scaling both, even when only one is constrained

Disaggregation says: let prefill and decode have separate capacity, separate tuning, and separate scaling.

That is a strong idea. It is also the beginning of the problem, not the end.

The request narrative has more steps now

Dynamo’s architecture docs describe a disaggregated request flow roughly like this:

  1. Client sends the request to the frontend.
  2. Frontend validates and forwards to the router.
  3. Router chooses a prefill worker.
  4. Prefill computes KV and returns transfer metadata.
  5. Router chooses a decode worker.
  6. Decode receives KV state, typically through the NIXL transfer path.
  7. Decode streams tokens back through the frontend.
  8. KV events update cache visibility.
  9. KVBM may offload or recall KV blocks based on pressure and reuse potential.

That list is the entire blog post hiding in plain sight.

Once you split phases, every request carries a state handoff. The platform has to decide where prefill happens, where decode happens, how KV state moves, how cache events are published, and how future requests learn from what just happened.

In other words: the runtime becomes a logistics system for tokens.

KV transfer is the tax

Disaggregation creates a tradeoff.

You gain phase-specific scaling and better utilization opportunities. You pay with transfer cost, topology sensitivity, and more operational moving parts.

The tradeoff is workload-dependent. A long prefill followed by substantial decode may benefit. A tiny prompt with a tiny output may not. A cluster with fast fabric and close worker placement has more room to win than a topology where prefill and decode are far apart. A system with fresh cache events and good routing can make better decisions than one guessing from stale state.

Disaggregated inference tradeoff between phase-specific scaling benefits and KV transfer or topology costs
The split wins only when the measured benefit beats the state movement cost.

This is why the data movement layer matters. Dynamo’s docs position NIXL as a low-latency data transfer layer used to enable KV movement between workers and memory domains. KVBM adds a block-oriented view of KV memory and can work across tiers such as GPU memory, pinned host memory, remote RDMA-accessible memory, SSD, and object storage.

Those are not decorative details. They are the mechanics that make disaggregation possible without turning every request into a manual state migration.

Aggregated serving is not the enemy

One trap in infrastructure writing is treating a technique like a moral upgrade.

Aggregated serving is not bad. It is simpler, easier to reason about, and often a good default. If your prompts are modest, outputs are predictable, traffic is not extreme, and operations need to stay lean, one worker pool can be the right answer.

Disaggregated serving is better when the workload shape justifies it.

Look for signals like:

  • large input sequence length compared with output length
  • latency targets that separate TTFT and TPOT pressure
  • bursty prompt-heavy traffic
  • long-running decode streams interfering with new prompt processing
  • high concurrency with uneven request shapes
  • enough fabric performance to move KV state cheaply
  • operators who can debug a multi-pool system under pressure

That last bullet is not a joke. Every clever architecture becomes a support burden if the team cannot explain it during an incident.

Dynamo’s disaggregated serving guide leans into measurement through AIConfigurator. The tool evaluates aggregated versus disaggregated options, estimates prefill and decode worker configurations, and uses workload inputs such as input sequence length, output sequence length, TTFT, TPOT, backend, and system type. That is the right posture: let the workload shape drive the architecture.

TTFT and TPOT are the scoreboards

Disaggregation is mostly about controlling two user-visible metrics.

Time To First Token (TTFT) is the waiting room. It captures how long a user waits before generation starts. Long prefill, routing delays, cache misses, and queueing can all hurt TTFT.

Time Per Output Token (TPOT), sometimes discussed near inter-token latency, is the streaming experience. It captures how smoothly tokens continue once output begins. Decode pressure, batching choices, memory pressure, and scheduling fairness show up here.

Scaling by generic utilization hides these differences. A GPU may look busy while TTFT is awful because the prefill side is constrained. Another may look reasonably utilized while decode queues make streaming feel sticky. Users do not experience utilization. They experience waiting and streaming.

The cleanest disaggregated systems make those metrics visible per phase.

If TTFT is failing, ask whether prefill capacity, cache misses, or routing are the bottleneck. If TPOT is failing, ask whether decode capacity, active blocks, batch shape, or memory pressure is the bottleneck. If both fail, congratulations, at least the diagnosis is not subtle.

KVBM changes the memory conversation

KV cache is not just “some memory.” It is reusable computation state with a lifecycle.

Dynamo’s KVBM documentation describes a scalable runtime component for allocation, management, and remote sharing of KV blocks across heterogeneous and distributed environments. It frames KVBM as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM, with integration into NIXL for memory registration, sharing, and access.

The key idea is block management.

Instead of treating KV cache as an opaque pile inside a worker, the system can reason about blocks: allocate, register, match, reuse, evict, offload, and recall. That matters in disaggregated serving because state may need to cross worker boundaries and memory tiers.

This is also where mature platforms start to separate themselves. Anyone can say “use cache.” The hard part is deciding which cache blocks are valuable, where they live, how they move, when they expire, and what to do under pressure.

Operational questions before deployment

Before I would call a disaggregated serving deployment production-ready, I would want answers to these questions:

  • What traffic shape was used to choose disaggregated over aggregated?
  • What are the target TTFT and TPOT values?
  • How are prefill and decode pools scaled independently?
  • What is the measured KV transfer cost?
  • Are prefill and decode workers placed with topology awareness?
  • What happens if a prefill worker fails after computing part of the KV state?
  • What happens if a decode worker waits on KV that never arrives?
  • How fresh are KV events?
  • How does scale-down handle in-flight requests?
  • Can operators correlate one user request across frontend, router, prefill, transfer, decode, and stream?

If those questions feel annoying, that is because they are the real architecture. The diagram is just the polite version.

Why Dynamo’s design is promising here

Dynamo has a useful advantage because it treats disaggregated serving as a system of cooperating pieces:

  • Router chooses workers using workload and cache signals.
  • NIXL provides a foundation for high-speed data movement.
  • KVBM gives KV blocks a managed lifecycle across memory tiers.
  • Planner can reason about prefill and decode targets.
  • Kubernetes integration can represent separate worker groups and scaling targets.
  • Feature matrix documentation keeps backend support explicit instead of hand-wavy.

That is a strong direction. It does not mean every workload should flip to disaggregated mode tomorrow morning. It means the platform has the right nouns for the problem.

When a technology gives you the right nouns, you can start having real design conversations.

The practical takeaway

Disaggregated inference is not “faster” in the abstract.

It is a trade:

phase-specific scaling + lower interference
minus
KV transfer cost + topology complexity + operational complexity

For the right workload, the trade is excellent. For the wrong workload, it is a very fancy way to move state around and feel productive.

The job of a serious inference platform is to measure the shape of traffic, split phases when the data says to, keep KV movement cheap, and make the control plane observable enough that humans can operate it.

Dynamo’s approach is compelling because it does not pretend disaggregation is just a flag. It gives the split a runtime, a router, a memory manager, a transfer layer, and a planner.

That is what this problem deserves.

Sources and receipts