19/20 - Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream
Prefill and decode run the same model but behave like different workloads. Prefill favors large compute-heavy matrix operations. Decode repeatedly reads weights and KV state for small token steps. Disaggregation gives each phase its own worker pool and scaling plan.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A restaurant separates food preparation from table service. The prep kitchen handles large batches of ingredients; servers handle frequent small interactions. The handoff must be fast, or specialization makes the meal slower.
How to read this diagram: Start with Frontend, where route prompt to prefill. The middle stage, KV transfer, move or expose state. The final stage, Decode pool, shows the observable result: generate token stream. The arrows describe dependency order, not necessarily separate services.
What actually happens
A prefill worker processes the prompt and produces KV cache state. Transfer metadata identifies blocks and endpoints. A decode worker acquires that state, continues autoregressive generation, and streams results through the frontend.
Independent pools let operators scale for input-token load and output-token concurrency separately. Prefill and decode can also use different tensor-parallel degrees or hardware shapes, provided KV layouts can be transferred or transformed correctly.
The transfer path is critical. Direct GPU-to-GPU movement over RDMA, NVLink, InfiniBand, or RoCE can keep handoff below the compute saved. TCP fallback, topology mistakes, mismatched block layouts, or serialized copies can erase the benefit.
A worked example
A retrieval workload sends 16,000-token prompts and generates 200-token answers. Prefill workers become compute-bound while decode workers need capacity for many active sequences. A 2P:6D pool can scale independently. If each KV handoff takes longer than local prefill would, conditional routing should keep short prompts local.
The performance model
Goodput means requests completed while meeting both TTFT and TPOT SLOs. Disaggregation removes phase interference and uncouples parallelism, but adds routing, queueing, and KV transfer. Compare against an aggregated baseline with identical workloads.
How to read this diagram: The left panel asks how Phase disaggregation changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.
Expert lens
Disaggregation should often be conditional. A short uncached prompt may be faster on the decode worker; a long prompt belongs in the prefill pool. Prefix overlap, prefill queue age, decode load, and topology can drive the decision.
How to read this diagram: The left panel is the baseline, Aggregated worker, characterized by prefill and decode colocated and no kv network handoff. The right panel applies Disaggregated pools, changing the cost profile to independent p and d scaling and kv state must move. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.
Where it wins
- Long prompts with strict decode latency
- Different prefill and decode scaling pressure
- Clusters with fast topology-aware KV transfer
Where it disappoints
- Assuming disaggregation always wins
- Benchmarking through local port forwarding
- Silently falling back from RDMA to TCP
- Mismatching model, dtype, TP, or block layout
Production checklist
- Benchmark aggregated and disaggregated baselines
- Validate transport inside the cluster
- Route with topology and queue awareness
- Check KV layout compatibility
- Implement graceful decode-only fallback
What to measure
- Prefill queue and decode queue age
- KV transfer latency and bandwidth
- TTFT and TPOT SLO attainment
- P-worker and D-worker utilization
- Local versus remote prefill decisions
From one GPU to a production service
A demonstration starts one P and one D worker. Production needs discovery, compatible-pair selection, topology constraints, transfer authentication, draining, version skew protection, and independent autoscalers whose actions do not create queue oscillation.
Model rollout is a distributed transaction. P and D workers must agree on model revision, tokenizer, adapter state, KV dtype, head layout, block size, and transfer protocol. Route only within a compatibility generation.
Failure policy should be phase-aware. Before KV export, another P worker can retry. During transfer, idempotent block identifiers help. After decode emits tokens, replay may duplicate output, so the stream needs a clear terminal error or resumable protocol.
Design-review questions
- What compatibility fingerprint binds P and D workers?
- Is transfer authenticated and tenant-scoped?
- When does conditional routing choose local prefill?
- How do independent autoscalers avoid oscillation?
- What is retryable at each handoff state?
How it connects to the rest of the series
KV caching creates the state being transferred. Tensor parallelism can differ by phase. Chunked prefill is the simpler aggregated alternative, and streaming generation begins after decode admission.
From equation to implementation
The break-even test compares remote prefill queue plus prefill compute plus KV transfer against local prefill interference and compute. Prefix hits reduce the uncached tokens and can move a request back below the remote threshold. Routing should use uncached prefill length, not raw prompt length.
KV transfer metadata must identify source blocks, destination layout, transport endpoint, and lifecycle. Different TP degrees may require reshaping KV heads or block layout. Transfer completion must be ordered before decode reads, while copies should overlap unrelated GPU work.
Implementation sketch
uncached = prompt_tokens - prefix_match_tokens
if remote_prefill_benefit(uncached, queues, topology):
p = prefill_router.select_worker(model, topology)
transfer_state = p.prefill_and_export(request)
d = decode_router.select_compatible_worker(transfer_state)
await d.import_kv_async(transfer_state)
else:
d = decode_router.select_worker(model)
d.prefill_locally(request)
return d.stream_decode(request)Capacity planning
Size P and D pools from input-token and output-token demand separately, then include transfer headroom. A prefill worker outage should not deadlock the service; decode workers need a bounded local-prefill fallback or explicit overload response.
Benchmarking without fooling yourself
- Benchmark inside the cluster with the intended transport.
- Sweep uncached prompt length and prefix-hit ratio.
- Measure aggregated and disaggregated goodput under identical SLOs.
- Fail P workers, transfer links, and discovery during active requests.
A production failure to design for
RDMA device resources disappear after a node upgrade and the backend falls back to TCP. Requests still succeed, but KV handoff rises from milliseconds to hundreds of milliseconds. Make transport mode and bandwidth part of readiness, not a debug log.
How to read this diagram: The operating cycle moves from Classify to Route, then Transfer and Recover. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.
Deeper engineering guide
Disaggregation places compute-heavy prefill and bandwidth-sensitive decode on separate worker pools. A prefill worker processes the prompt and produces KV state; that state is transferred or published so a decode worker can continue token generation without repeating prefill.
How to read this diagram: Follow the state from Route prefill through Build KV and Transfer to Decode. Each box is an ownership or computation boundary. In particular, the handoff is complete only when decode can read every required block. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.
The routing decision must include transfer time. A remote decode worker with an empty queue may lose to a colocated worker when the prompt produces gigabytes of KV. Estimate prefill queue + prefill compute + KV transfer + decode queue + decode service, then choose jointly rather than routing each phase independently.
How to read this diagram: The bars compare Colocated worker with Disaggregated pools on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Prefill and decode scale on different hardware ratios.”, remains larger than the risk, “KV movement and coordination can exceed compute savings.”, under production traffic.
KV manifests carry model revision, tokenizer and adapter scope, layer layout, dtype, block order, positions, ownership, and integrity. Decode must reject incompatible state and fall back safely. Partial transfers remain private until an atomic readiness marker publishes the complete manifest.
How to read this diagram: State advances from Producing to Sealed, Transferring, and finally Adopted. The labels below each state identify what becomes true at that boundary. The governing invariant is: Failure before adoption leaves prefill responsible for cleanup; after adoption decode owns it. Retries and cancellation must preserve the same transition rules.
How to read this diagram: The four panels are independent review axes: Placement, Transfer, Compatibility, and Recovery. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Trace one request identity across both pools and the transfer service.
How to read this diagram: This is a causal chain, not four unrelated symptoms. Transfers slow triggers Decode starves, which creates Memory pins. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.
Primary references
The takeaway
Disaggregation is not a topology diagram; it is an SLO decision. It wins when phase specialization is worth more than the KV handoff costs.
