Prefill vs Decode: The Hidden Split That Shapes Every LLM Serving Architecture

#inference #prefill #decode #llm #gpu #dynamo #vllm #tensorrt-llm

LLM inference looks simple from the API boundary:

prompt in -> tokens out

That is the polite version. Inside the engine, the request actually goes through two very different jobs.

First, the model reads the prompt. That is prefill. Then it generates one token at a time. That is decode. They live in the same request, but they behave like two coworkers who share a calendar and absolutely should not share a desk.

Prefill likes big matrix operations. Decode likes memory bandwidth and low-latency repetition. Prefill can arrive as a giant 32K-token prompt and spike the system. Decode is usually smaller per step, but it repeats over and over while holding KV cache. If you batch them carelessly, one phase annoys the other. In production, “annoys” is spelled P99.

The two phases

During prefill, the engine processes all input tokens and builds the KV cache. This phase is usually compute-heavy and benefits from batching large chunks of work.

During decode, the engine uses the cache to generate the next token, appends new KV entries, and repeats until it hits stop conditions. Decode is latency-sensitive because users experience it as token streaming. Nobody complains that the matrix multiply was elegant. They complain that the cursor stopped moving.

Prefill builds the runway. Decode keeps the plane moving down it, one token at a time.

This split explains why MLCommons calls out TTFT and TPOT for LLM inference. Time to first token is shaped heavily by prefill and queueing. Time per output token is the decode heartbeat.

Why mixing them hurts

Imagine a short chat request already decoding smoothly. Now a giant retrieval prompt lands in the same batch. The prefill job wants a big compute meal. The decode job wants tiny, frequent snacks. If the scheduler does not separate or carefully interleave them, decode tokens get delayed. Users see the stream freeze and start wondering if your product is thinking deep thoughts or just stuck.

Early serving systems colocated prefill and decode on the same workers because it is simpler. Simpler is good until the workload gets popular. Then the interaction between phases becomes the bottleneck.

Research systems like DistServe formalized this: disaggregating prefill and decode can reduce interference and let each phase use different resource allocation and parallelism strategies. The paper reports much higher goodput under latency constraints compared with colocated approaches across evaluated workloads. As always, paper numbers are workload-specific, but the architectural lesson is durable: the phases want different scheduling.

SARATHI explored a different angle: chunk prefill into smaller pieces and piggyback decode work with prefill chunks. That is a nice reminder that the answer is not always “split everything.” Sometimes the answer is “stop feeding the scheduler one enormous sandwich.”

Aggregated vs disaggregated serving

There are three common shapes:

Aggregated: the same worker handles prefill and decode.
Chunked prefill: the same worker handles both, but the prefill work is sliced so decode does not starve.
Disaggregated: separate prefill and decode workers, with KV cache transfer between them.

The farther you split the phases, the more control you gain and the more coordination you must pay for.

What NVIDIA Dynamo gets right

Dynamo is interesting because it treats this as a distributed systems problem, not just a kernel problem. It includes KV-aware routing, disaggregated serving support, an SLO planner, and data movement components for KV cache transfer and offload. That is the right vocabulary.

The unglamorous truth is that prefill/decode optimization is a whole-stack problem:

The engine needs efficient attention kernels and batching.
The runtime needs to know whether a request is prefill-heavy or decode-heavy.
The router needs cache affinity.
The scheduler needs SLOs.
The platform needs GPU placement and fast data movement.

This is where NVIDIA has an unusually coherent story. TensorRT-LLM handles optimized execution. NIM packages deployment. Dynamo coordinates distributed inference. NVLink and high-bandwidth GPU memory give the software room to work. None of these pieces are magic alone. Together, they form a stack that understands the shape of the workload.

A quick diagnosis table

If the system is already in production, the symptoms usually tell you which phase is complaining:

Symptom	Likely pressure point	First thing to inspect
Time to first token jumps on long prompts	Prefill queueing	input token histogram and batch policy
Stream starts fast then stutters	Decode scheduling	TPOT p95/p99 and batch interleaving
GPU memory fills while utilization looks modest	KV cache	active sequences and cache eviction
Small chats suffer during document uploads	Phase interference	chunked prefill or prefill isolation
Multi-turn sessions get expensive	Prefix reuse	cache hit rate and routing affinity

This table is not a debugger. It is a way to stop blaming the entire stack when one phase is waving a tiny flag.

Design advice

If you are building an inference service, start with these questions:

What are your input and output token distributions?
Is TTFT or TPOT more important for your product?
Do you serve long prompts and short answers, or short prompts and long answers?
Are multi-turn sessions common?
Can your engine reuse prefix cache?
Can your router observe cache locality?
What happens to decode latency when a large prefill arrives?

Do not begin with “How many GPUs do we need?” Begin with “Which phase is hurting us?”

For many teams, aggregated vLLM or TensorRT-LLM is plenty. Add chunked prefill when long prompts start hurting streams. Move toward disaggregated serving when prefill/decode interference becomes the dominant limit and you have enough traffic to justify the operational complexity.

The punchline is simple: prefill and decode are both inference, but they are not the same workload. Treating them as one thing is fine for the demo. Production will eventually notice, and production has a very dry sense of humor.

Sources and receipts

DistServe paper: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving.
SARATHI paper: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.
NVIDIA Dynamo: Dynamo technical blog and architecture docs.
MLCommons latency terminology: Llama 2 70B MLPerf Inference benchmark note.
vLLM internals: Inside vLLM.

Inference Is a Memory Problem: KV Cache, HBM, and the Real Cost of Long Context Disaggregated Inference on Kubernetes: Routing, Scheduling, and Scaling Beyond One GPU