19/20 - Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream

#disaggregated-serving #prefill #decode #nvidia-dynamo #kv-transfer

Prefill and decode run the same model but behave like different workloads. Prefill favors large compute-heavy matrix operations. Decode repeatedly reads weights and KV state for small token steps. Disaggregation gives each phase its own worker pool and scaling plan.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A restaurant separates food preparation from table service. The prep kitchen handles large batches of ingredients; servers handle frequent small interactions. The handoff must be fast, or specialization makes the meal slower.

Follow the state and work from left to right.

Description: Start with Frontend, where route prompt to prefill. The middle stage, KV transfer, move or expose state. The final stage, Decode pool, shows the observable result: generate token stream. The arrows describe dependency order, not necessarily separate services.

What actually happens

A prefill worker processes the prompt and produces KV cache state. Transfer metadata identifies blocks and endpoints. A decode worker acquires that state, continues autoregressive generation, and streams results through the frontend.

Independent pools let operators scale for input-token load and output-token concurrency separately. Prefill and decode can also use different tensor-parallel degrees or hardware shapes, provided KV layouts can be transferred or transformed correctly.

The transfer path is critical. Direct GPU-to-GPU movement over RDMA, NVLink, InfiniBand, or RoCE can keep handoff below the compute saved. TCP fallback, topology mistakes, mismatched block layouts, or serialized copies can erase the benefit.

A worked example

A retrieval workload sends 16,000-token prompts and generates 200-token answers. Prefill workers become compute-bound while decode workers need capacity for many active sequences. A 2P:6D pool can scale independently. If each KV handoff takes longer than local prefill would, conditional routing should keep short prompts local.

The performance model

Goodput means requests completed while meeting both TTFT and TPOT SLOs. Disaggregation removes phase interference and uncouples parallelism, but adds routing, queueing, and KV transfer. Compare against an aggregated baseline with identical workloads.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Phase disaggregation changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Disaggregation should often be conditional. A short uncached prompt may be faster on the decode worker; a long prompt belongs in the prefill pool. Prefix overlap, prefill queue age, decode load, and topology can drive the decision.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Aggregated worker, characterized by prefill and decode colocated and no kv network handoff. The right panel applies Disaggregated pools, changing the cost profile to independent p and d scaling and kv state must move. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Long prompts with strict decode latency
Different prefill and decode scaling pressure
Clusters with fast topology-aware KV transfer

Where it disappoints

Assuming disaggregation always wins
Benchmarking through local port forwarding
Silently falling back from RDMA to TCP
Mismatching model, dtype, TP, or block layout

Production checklist

Benchmark aggregated and disaggregated baselines
Validate transport inside the cluster
Route with topology and queue awareness
Check KV layout compatibility
Implement graceful decode-only fallback

What to measure

Prefill queue and decode queue age
KV transfer latency and bandwidth
TTFT and TPOT SLO attainment
P-worker and D-worker utilization
Local versus remote prefill decisions

From one GPU to a production service

A demonstration starts one P and one D worker. Production needs discovery, compatible-pair selection, topology constraints, transfer authentication, draining, version skew protection, and independent autoscalers whose actions do not create queue oscillation.

Model rollout is a distributed transaction. P and D workers must agree on model revision, tokenizer, adapter state, KV dtype, head layout, block size, and transfer protocol. Route only within a compatibility generation.

Failure policy should be phase-aware. Before KV export, another P worker can retry. During transfer, idempotent block identifiers help. After decode emits tokens, replay may duplicate output, so the stream needs a clear terminal error or resumable protocol.

Design-review questions

What compatibility fingerprint binds P and D workers?
Is transfer authenticated and tenant-scoped?
When does conditional routing choose local prefill?
How do independent autoscalers avoid oscillation?
What is retryable at each handoff state?

How it connects to the rest of the series

KV caching creates the state being transferred. Tensor parallelism can differ by phase. Chunked prefill is the simpler aggregated alternative, and streaming generation begins after decode admission.

From equation to implementation

The break-even test compares remote prefill queue plus prefill compute plus KV transfer against local prefill interference and compute. Prefix hits reduce the uncached tokens and can move a request back below the remote threshold. Routing should use uncached prefill length, not raw prompt length.

KV transfer metadata must identify source blocks, destination layout, transport endpoint, and lifecycle. Different TP degrees may require reshaping KV heads or block layout. Transfer completion must be ordered before decode reads, while copies should overlap unrelated GPU work.

Implementation sketch

uncached = prompt_tokens - prefix_match_tokens
if remote_prefill_benefit(uncached, queues, topology):
    p = prefill_router.select_worker(model, topology)
    transfer_state = p.prefill_and_export(request)
    d = decode_router.select_compatible_worker(transfer_state)
    await d.import_kv_async(transfer_state)
else:
    d = decode_router.select_worker(model)
    d.prefill_locally(request)
return d.stream_decode(request)

Capacity planning

Size P and D pools from input-token and output-token demand separately, then include transfer headroom. A prefill worker outage should not deadlock the service; decode workers need a bounded local-prefill fallback or explicit overload response.

Benchmarking without fooling yourself

Benchmark inside the cluster with the intended transport.
Sweep uncached prompt length and prefix-hit ratio.
Measure aggregated and disaggregated goodput under identical SLOs.
Fail P workers, transfer links, and discovery during active requests.

A production failure to design for

RDMA device resources disappear after a node upgrade and the backend falls back to TCP. Requests still succeed, but KV handoff rises from milliseconds to hundreds of milliseconds. Make transport mode and bandwidth part of readiness, not a debug log.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Classify to Route, then Transfer and Recover. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Disaggregation places compute-heavy prefill and bandwidth-sensitive decode on separate worker pools. A prefill worker processes the prompt and produces KV state; that state is transferred or published so a decode worker can continue token generation without repeating prefill.

Disaggregation inserts a distributed state-transfer protocol between two inference phases.

Description: Follow the state from Route prefill through Build KV and Transfer to Decode. Each box is an ownership or computation boundary. In particular, the handoff is complete only when decode can read every required block. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

The routing decision must include transfer time. A remote decode worker with an empty queue may lose to a colocated worker when the prompt produces gigabytes of KV. Estimate prefill queue + prefill compute + KV transfer + decode queue + decode service, then choose jointly rather than routing each phase independently.

Disaggregation wins only when specialization repays transfer and queue costs.

Description: The bars compare Colocated worker with Disaggregated pools on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Prefill and decode scale on different hardware ratios.”, remains larger than the risk, “KV movement and coordination can exceed compute savings.”, under production traffic.

KV manifests carry model revision, tokenizer and adapter scope, layer layout, dtype, block order, positions, ownership, and integrity. Decode must reject incompatible state and fall back safely. Partial transfers remain private until an atomic readiness marker publishes the complete manifest.

Single-owner transitions prevent leaks and use of partial KV state.

Description: State advances from Producing to Sealed, Transferring, and finally Adopted. The labels below each state identify what becomes true at that boundary. The governing invariant is: Failure before adoption leaves prefill responsible for cleanup; after adoption decode owns it. Retries and cancellation must preserve the same transition rules.

Operationally, disaggregation is a distributed transaction on inference state.

Description: The four panels are independent review axes: Placement, Transfer, Compatibility, and Recovery. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Trace one request identity across both pools and the transfer service.

A healthy pair of GPU pools can still fail because the state plane is saturated.

Description: This is a causal chain, not four unrelated symptoms. Transfers slow triggers Decode starves, which creates Memory pins. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Disaggregation is not a topology diagram; it is an SLO decision. It wins when phase specialization is worth more than the KV handoff costs.

18/20 - Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else 20/20 - Expert Parallelism: Routing Tokens Through a City of Specialists