Skip to content
19/20 - Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream

19/20 - Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream

Prefill and decode run the same model but behave like different workloads. Prefill favors large compute-heavy matrix operations. Decode repeatedly reads weights and KV state for small token steps. Disaggregation gives each phase its own worker pool and scaling plan.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A restaurant separates food preparation from table service. The prep kitchen handles large batches of ingredients; servers handle frequent small interactions. The handoff must be fast, or specialization makes the meal slower.

MECHANISM FLOWPrefill-Decode Disaggregation: request path01FrontendRoute prompt to prefillBuild KV blocks02KV transferMove or expose stateSelect decode worker03Decode poolGenerate token streamReturn through frontendINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Frontend, where route prompt to prefill. The middle stage, KV transfer, move or expose state. The final stage, Decode pool, shows the observable result: generate token stream. The arrows describe dependency order, not necessarily separate services.

What actually happens

A prefill worker processes the prompt and produces KV cache state. Transfer metadata identifies blocks and endpoints. A decode worker acquires that state, continues autoregressive generation, and streams results through the frontend.

Independent pools let operators scale for input-token load and output-token concurrency separately. Prefill and decode can also use different tensor-parallel degrees or hardware shapes, provided KV layouts can be transferred or transformed correctly.

The transfer path is critical. Direct GPU-to-GPU movement over RDMA, NVLink, InfiniBand, or RoCE can keep handoff below the compute saved. TCP fallback, topology mistakes, mismatched block layouts, or serialized copies can erase the benefit.

A worked example

A retrieval workload sends 16,000-token prompts and generates 200-token answers. Prefill workers become compute-bound while decode workers need capacity for many active sequences. A 2P:6D pool can scale independently. If each KV handoff takes longer than local prefill would, conditional routing should keep short prompts local.

The performance model

Goodput means requests completed while meeting both TTFT and TPOT SLOs. Disaggregation removes phase interference and uncouples parallelism, but adds routing, queueing, and KV transfer. Compare against an aggregated baseline with identical workloads.

PHASE FITWhere Phase disaggregation changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityRuns on compute-oriented workersDECODEOne new token per iterationWeight and KV bandwidth pressureRuns on bandwidth-oriented workersPROVE IT WITHTTFT, TPOT, handoff time, goodputDEPLOYMENT DECISIONPlace phases with transfer included
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Phase disaggregation changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Disaggregation should often be conditional. A short uncached prompt may be faster on the decode worker; a long prompt belongs in the prefill pool. Prefix overlap, prefill queue age, decode load, and topology can drive the decision.

TRADE-OFF MAPPrefill-Decode Disaggregation: the tradeoffBASELINEAggregated workerPrefill and decode colocatedNo KV network handoffCoupled scalingPhase interferenceVSOPTIMIZEDDisaggregated poolsIndependent P and D scalingKV state must movePhase-specific parallelismMore routing complexityMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Aggregated worker, characterized by prefill and decode colocated and no kv network handoff. The right panel applies Disaggregated pools, changing the cost profile to independent p and d scaling and kv state must move. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Long prompts with strict decode latency
  • Different prefill and decode scaling pressure
  • Clusters with fast topology-aware KV transfer

Where it disappoints

  • Assuming disaggregation always wins
  • Benchmarking through local port forwarding
  • Silently falling back from RDMA to TCP
  • Mismatching model, dtype, TP, or block layout

Production checklist

  • Benchmark aggregated and disaggregated baselines
  • Validate transport inside the cluster
  • Route with topology and queue awareness
  • Check KV layout compatibility
  • Implement graceful decode-only fallback

What to measure

  • Prefill queue and decode queue age
  • KV transfer latency and bandwidth
  • TTFT and TPOT SLO attainment
  • P-worker and D-worker utilization
  • Local versus remote prefill decisions

From one GPU to a production service

A demonstration starts one P and one D worker. Production needs discovery, compatible-pair selection, topology constraints, transfer authentication, draining, version skew protection, and independent autoscalers whose actions do not create queue oscillation.

Model rollout is a distributed transaction. P and D workers must agree on model revision, tokenizer, adapter state, KV dtype, head layout, block size, and transfer protocol. Route only within a compatibility generation.

Failure policy should be phase-aware. Before KV export, another P worker can retry. During transfer, idempotent block identifiers help. After decode emits tokens, replay may duplicate output, so the stream needs a clear terminal error or resumable protocol.

Design-review questions

  • What compatibility fingerprint binds P and D workers?
  • Is transfer authenticated and tenant-scoped?
  • When does conditional routing choose local prefill?
  • How do independent autoscalers avoid oscillation?
  • What is retryable at each handoff state?

How it connects to the rest of the series

KV caching creates the state being transferred. Tensor parallelism can differ by phase. Chunked prefill is the simpler aggregated alternative, and streaming generation begins after decode admission.

From equation to implementation

The break-even test compares remote prefill queue plus prefill compute plus KV transfer against local prefill interference and compute. Prefix hits reduce the uncached tokens and can move a request back below the remote threshold. Routing should use uncached prefill length, not raw prompt length.

KV transfer metadata must identify source blocks, destination layout, transport endpoint, and lifecycle. Different TP degrees may require reshaping KV heads or block layout. Transfer completion must be ordered before decode reads, while copies should overlap unrelated GPU work.

Implementation sketch

uncached = prompt_tokens - prefix_match_tokens
if remote_prefill_benefit(uncached, queues, topology):
    p = prefill_router.select_worker(model, topology)
    transfer_state = p.prefill_and_export(request)
    d = decode_router.select_compatible_worker(transfer_state)
    await d.import_kv_async(transfer_state)
else:
    d = decode_router.select_worker(model)
    d.prefill_locally(request)
return d.stream_decode(request)

Capacity planning

Size P and D pools from input-token and output-token demand separately, then include transfer headroom. A prefill worker outage should not deadlock the service; decode workers need a bounded local-prefill fallback or explicit overload response.

Benchmarking without fooling yourself

  • Benchmark inside the cluster with the intended transport.
  • Sweep uncached prompt length and prefix-hit ratio.
  • Measure aggregated and disaggregated goodput under identical SLOs.
  • Fail P workers, transfer links, and discovery during active requests.

A production failure to design for

RDMA device resources disappear after a node upgrade and the backend falls back to TCP. Requests still succeed, but KV handoff rises from milliseconds to hundreds of milliseconds. Make transport mode and bandwidth part of readiness, not a debug log.

OPERATING LOOPOperational loop1ClassifyUncached prefill workSLO and queue state2RouteTopology-compatible PCompatible D layout3TransferVerify RDMA pathOrder completion4RecoverLocal fallbackRebalance P and DMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Classify to Route, then Transfer and Recover. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Disaggregation places compute-heavy prefill and bandwidth-sensitive decode on separate worker pools. A prefill worker processes the prompt and produces KV state; that state is transferred or published so a decode worker can continue token generation without repeating prefill.

A disaggregated request handoffRoute prefillSelect compute workerReserve decode targetAttach request identityBuild KVRun prompt tokensSeal cache blocksPublish manifestTransferMove only required stateVerify version and scopeRespect deadlineDecodeAdopt KV ownershipStream output tokensRelease on finishThe handoff is complete only when decode can read every required block.
Disaggregation inserts a distributed state-transfer protocol between two inference phases.

How to read this diagram: Follow the state from Route prefill through Build KV and Transfer to Decode. Each box is an ownership or computation boundary. In particular, the handoff is complete only when decode can read every required block. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

The routing decision must include transfer time. A remote decode worker with an empty queue may lose to a colocated worker when the prompt produces gigabytes of KV. Estimate prefill queue + prefill compute + KV transfer + decode queue + decode service, then choose jointly rather than routing each phase independently.

Specialized pools gain efficiency but add a handoffColocated workerno KV transferDisaggregated poolsphase specializationHandoff taxKV movement and coordination can exceed compute savings.Fleet gainPrefill and decode scale on different hardware ratios.
Disaggregation wins only when specialization repays transfer and queue costs.

How to read this diagram: The bars compare Colocated worker with Disaggregated pools on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Prefill and decode scale on different hardware ratios.”, remains larger than the risk, “KV movement and coordination can exceed compute savings.”, under production traffic.

KV manifests carry model revision, tokenizer and adapter scope, layer layout, dtype, block order, positions, ownership, and integrity. Decode must reject incompatible state and fall back safely. Partial transfers remain private until an atomic readiness marker publishes the complete manifest.

KV handoff state machineProducingprefill owns blocksSealedmanifest immutableTransferringdecode not readyAdopteddecode owns stateFailure before adoption leaves prefill responsible for cleanup; after adoption decode owns it.
Single-owner transitions prevent leaks and use of partial KV state.

How to read this diagram: State advances from Producing to Sealed, Transferring, and finally Adopted. The labels below each state identify what becomes true at that boundary. The governing invariant is: Failure before adoption leaves prefill responsible for cleanup; after adoption decode owns it. Retries and cancellation must preserve the same transition rules.

Four disaggregation control planesPlacementJoint prefill/decode routeCapacity reservationsTransferBandwidth and topologyBackpressureCompatibilityModel and KV layoutSecurity scopeRecoveryTimeout and fallbackIdempotent cleanupTrace one request identity across both pools and the transfer service.
Operationally, disaggregation is a distributed transaction on inference state.

How to read this diagram: The four panels are independent review axes: Placement, Transfer, Compatibility, and Recovery. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Trace one request identity across both pools and the transfer service.

A slow KV fabric strands both poolsTransfers slowPrefill finishesBlocks await movementDecode starvesWorkers sit idleTTFT risesMemory pinsPrefill cannot reclaimAdmission closesControlBound transfer queueRoute colocated fallbackMonitor bytes waiting and handoff age, not only GPU utilization.
A healthy pair of GPU pools can still fail because the state plane is saturated.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Transfers slow triggers Decode starves, which creates Memory pins. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Disaggregation is not a topology diagram; it is an SLO decision. It wins when phase specialization is worth more than the KV handoff costs.