Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream

#disaggregated-serving #prefill #decode #nvidia-dynamo #kv-transfer

Prefill and decode run the same model but behave like different workloads. Prefill favors large compute-heavy matrix operations. Decode repeatedly reads weights and KV state for small token steps. Disaggregation gives each phase its own worker pool and scaling plan.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A restaurant separates food preparation from table service. The prep kitchen handles large batches of ingredients; servers handle frequent small interactions. The handoff must be fast, or specialization makes the meal slower.

Follow the state and work from left to right.

What actually happens

A prefill worker processes the prompt and produces KV cache state. Transfer metadata identifies blocks and endpoints. A decode worker acquires that state, continues autoregressive generation, and streams results through the frontend.

Independent pools let operators scale for input-token load and output-token concurrency separately. Prefill and decode can also use different tensor-parallel degrees or hardware shapes, provided KV layouts can be transferred or transformed correctly.

The transfer path is critical. Direct GPU-to-GPU movement over RDMA, NVLink, InfiniBand, or RoCE can keep handoff below the compute saved. TCP fallback, topology mistakes, mismatched block layouts, or serialized copies can erase the benefit.

A worked example

A retrieval workload sends 16,000-token prompts and generates 200-token answers. Prefill workers become compute-bound while decode workers need capacity for many active sequences. A 2P:6D pool can scale independently. If each KV handoff takes longer than local prefill would, conditional routing should keep short prompts local.

The performance model

Goodput means requests completed while meeting both TTFT and TPOT SLOs. Disaggregation removes phase interference and uncouples parallelism, but adds routing, queueing, and KV transfer. Compare against an aggregated baseline with identical workloads.

Expert lens

Disaggregation should often be conditional. A short uncached prompt may be faster on the decode worker; a long prompt belongs in the prefill pool. Prefix overlap, prefill queue age, decode load, and topology can drive the decision.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Long prompts with strict decode latency
Different prefill and decode scaling pressure
Clusters with fast topology-aware KV transfer

Where it disappoints

Assuming disaggregation always wins
Benchmarking through local port forwarding
Silently falling back from RDMA to TCP
Mismatching model, dtype, TP, or block layout

Production checklist

Benchmark aggregated and disaggregated baselines
Validate transport inside the cluster
Route with topology and queue awareness
Check KV layout compatibility
Implement graceful decode-only fallback

What to measure

Prefill queue and decode queue age
KV transfer latency and bandwidth
TTFT and TPOT SLO attainment
P-worker and D-worker utilization
Local versus remote prefill decisions

From one GPU to a production service

A demonstration starts one P and one D worker. Production needs discovery, compatible-pair selection, topology constraints, transfer authentication, draining, version skew protection, and independent autoscalers whose actions do not create queue oscillation.

Model rollout is a distributed transaction. P and D workers must agree on model revision, tokenizer, adapter state, KV dtype, head layout, block size, and transfer protocol. Route only within a compatibility generation.

Failure policy should be phase-aware. Before KV export, another P worker can retry. During transfer, idempotent block identifiers help. After decode emits tokens, replay may duplicate output, so the stream needs a clear terminal error or resumable protocol.

Design-review questions

What compatibility fingerprint binds P and D workers?
Is transfer authenticated and tenant-scoped?
When does conditional routing choose local prefill?
How do independent autoscalers avoid oscillation?
What is retryable at each handoff state?

How it connects to the rest of the series

KV caching creates the state being transferred. Tensor parallelism can differ by phase. Chunked prefill is the simpler aggregated alternative, and streaming generation begins after decode admission.

From equation to implementation

The break-even test compares remote prefill queue plus prefill compute plus KV transfer against local prefill interference and compute. Prefix hits reduce the uncached tokens and can move a request back below the remote threshold. Routing should use uncached prefill length, not raw prompt length.

KV transfer metadata must identify source blocks, destination layout, transport endpoint, and lifecycle. Different TP degrees may require reshaping KV heads or block layout. Transfer completion must be ordered before decode reads, while copies should overlap unrelated GPU work.

Implementation sketch

uncached = prompt_tokens - prefix_match_tokens
if remote_prefill_benefit(uncached, queues, topology):
    p = prefill_router.select_worker(model, topology)
    transfer_state = p.prefill_and_export(request)
    d = decode_router.select_compatible_worker(transfer_state)
    await d.import_kv_async(transfer_state)
else:
    d = decode_router.select_worker(model)
    d.prefill_locally(request)
return d.stream_decode(request)

Capacity planning

Size P and D pools from input-token and output-token demand separately, then include transfer headroom. A prefill worker outage should not deadlock the service; decode workers need a bounded local-prefill fallback or explicit overload response.

Benchmarking without fooling yourself

Benchmark inside the cluster with the intended transport.
Sweep uncached prompt length and prefix-hit ratio.
Measure aggregated and disaggregated goodput under identical SLOs.
Fail P workers, transfer links, and discovery during active requests.

A production failure to design for

RDMA device resources disappear after a node upgrade and the backend falls back to TCP. Requests still succeed, but KV handoff rises from milliseconds to hundreds of milliseconds. Make transport mode and bandwidth part of readiness, not a debug log.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Disaggregation is not a topology diagram; it is an SLO decision. It wins when phase specialization is worth more than the KV handoff costs.

Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else