Prefill vs Decode: The Hidden Split That Shapes Every LLM Serving Architecture
LLM inference looks simple from the API boundary:
prompt in -> tokens outThat is the polite version. Inside the engine, the request actually goes through two very different jobs.
First, the model reads the prompt. That is prefill. Then it generates one token at a time. That is decode. They live in the same request, but they behave like two coworkers who share a calendar and absolutely should not share a desk.
Prefill likes big matrix operations. Decode likes memory bandwidth and low-latency repetition. Prefill can arrive as a giant 32K-token prompt and spike the system. Decode is usually smaller per step, but it repeats over and over while holding KV cache. If you batch them carelessly, one phase annoys the other. In production, “annoys” is spelled P99.
The two phases
During prefill, the engine processes all input tokens and builds the KV cache. This phase is usually compute-heavy and benefits from batching large chunks of work.
During decode, the engine uses the cache to generate the next token, appends new KV entries, and repeats until it hits stop conditions. Decode is latency-sensitive because users experience it as token streaming. Nobody complains that the matrix multiply was elegant. They complain that the cursor stopped moving.
This split explains why MLCommons calls out TTFT and TPOT for LLM inference. Time to first token is shaped heavily by prefill and queueing. Time per output token is the decode heartbeat.
Why mixing them hurts
Imagine a short chat request already decoding smoothly. Now a giant retrieval prompt lands in the same batch. The prefill job wants a big compute meal. The decode job wants tiny, frequent snacks. If the scheduler does not separate or carefully interleave them, decode tokens get delayed. Users see the stream freeze and start wondering if your product is thinking deep thoughts or just stuck.
Early serving systems colocated prefill and decode on the same workers because it is simpler. Simpler is good until the workload gets popular. Then the interaction between phases becomes the bottleneck.
Research systems like DistServe formalized this: disaggregating prefill and decode can reduce interference and let each phase use different resource allocation and parallelism strategies. The paper reports much higher goodput under latency constraints compared with colocated approaches across evaluated workloads. As always, paper numbers are workload-specific, but the architectural lesson is durable: the phases want different scheduling.
SARATHI explored a different angle: chunk prefill into smaller pieces and piggyback decode work with prefill chunks. That is a nice reminder that the answer is not always “split everything.” Sometimes the answer is “stop feeding the scheduler one enormous sandwich.”
Aggregated vs disaggregated serving
There are three common shapes:
- Aggregated: the same worker handles prefill and decode.
- Chunked prefill: the same worker handles both, but the prefill work is sliced so decode does not starve.
- Disaggregated: separate prefill and decode workers, with KV cache transfer between them.
What NVIDIA Dynamo gets right
Dynamo is interesting because it treats this as a distributed systems problem, not just a kernel problem. It includes KV-aware routing, disaggregated serving support, an SLO planner, and data movement components for KV cache transfer and offload. That is the right vocabulary.
The unglamorous truth is that prefill/decode optimization is a whole-stack problem:
- The engine needs efficient attention kernels and batching.
- The runtime needs to know whether a request is prefill-heavy or decode-heavy.
- The router needs cache affinity.
- The scheduler needs SLOs.
- The platform needs GPU placement and fast data movement.
This is where NVIDIA has an unusually coherent story. TensorRT-LLM handles optimized execution. NIM packages deployment. Dynamo coordinates distributed inference. NVLink and high-bandwidth GPU memory give the software room to work. None of these pieces are magic alone. Together, they form a stack that understands the shape of the workload.
A quick diagnosis table
If the system is already in production, the symptoms usually tell you which phase is complaining:
| Symptom | Likely pressure point | First thing to inspect |
|---|---|---|
| Time to first token jumps on long prompts | Prefill queueing | input token histogram and batch policy |
| Stream starts fast then stutters | Decode scheduling | TPOT p95/p99 and batch interleaving |
| GPU memory fills while utilization looks modest | KV cache | active sequences and cache eviction |
| Small chats suffer during document uploads | Phase interference | chunked prefill or prefill isolation |
| Multi-turn sessions get expensive | Prefix reuse | cache hit rate and routing affinity |
This table is not a debugger. It is a way to stop blaming the entire stack when one phase is waving a tiny flag.
Design advice
If you are building an inference service, start with these questions:
- What are your input and output token distributions?
- Is TTFT or TPOT more important for your product?
- Do you serve long prompts and short answers, or short prompts and long answers?
- Are multi-turn sessions common?
- Can your engine reuse prefix cache?
- Can your router observe cache locality?
- What happens to decode latency when a large prefill arrives?
Do not begin with “How many GPUs do we need?” Begin with “Which phase is hurting us?”
For many teams, aggregated vLLM or TensorRT-LLM is plenty. Add chunked prefill when long prompts start hurting streams. Move toward disaggregated serving when prefill/decode interference becomes the dominant limit and you have enough traffic to justify the operational complexity.
The punchline is simple: prefill and decode are both inference, but they are not the same workload. Treating them as one thing is fine for the demo. Production will eventually notice, and production has a very dry sense of humor.
Sources and receipts
- DistServe paper: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving.
- SARATHI paper: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.
- NVIDIA Dynamo: Dynamo technical blog and architecture docs.
- MLCommons latency terminology: Llama 2 70B MLPerf Inference benchmark note.
- vLLM internals: Inside vLLM.