Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else
A 100,000-token prompt can occupy a GPU long enough to make every active stream stutter. Chunked prefill breaks that prompt into bounded pieces so decode iterations can keep making progress between chunks.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A delivery truck carrying one enormous order should not block the only loading dock all afternoon. Split the order into pallets and let urgent small shipments pass between them.
What actually happens
Prefill computes attention for prompt tokens and builds KV state. It is typically compute-heavy and can form large matrix operations. Decode is usually memory-bandwidth-heavy and latency-sensitive. Co-scheduling an unbounded prefill can therefore disrupt inter-token latency.
Chunked prefill limits the prompt tokens admitted in one iteration. The scheduler fills a token budget with decode work first or according to policy, then uses remaining capacity for one or more prefill chunks.
Partial prompt state must remain correct across chunks. Causal positions, attention metadata, KV block allocation, and prefix-cache matches must advance consistently. Cancellation must free the partially built cache.
A worked example
Assume a scheduler budget of 4,096 tokens per iteration and active decode uses 256 token positions. A 20,000-token prompt can enter as chunks near 3,840 tokens instead of monopolizing one giant iteration. Existing streams continue to receive decode steps between chunks.
The performance model
Smaller chunks protect TPOT but add more scheduling rounds and may reduce prefill GEMM efficiency. Larger chunks improve prefill throughput but create longer stalls. Tune against both TTFT for new prompts and TPOT for active streams.
Expert lens
A fixed chunk size is not always optimal. The scheduler can adapt to active decode load, prompt length, prefix hits, and SLO class. When decode load is low, larger chunks are efficient; under pressure, shrink them.
Where it wins
- Mixed long-prompt and streaming traffic
- Continuous-batching engines
- Services with strict TPOT SLOs
Where it disappoints
- Choosing chunk size from one benchmark
- Forgetting partial-KV cleanup on cancellation
- Letting chunks violate priority policy
- Reporting prefill throughput without decode interference
Production checklist
- Set an explicit per-iteration token budget
- Benchmark multiple prompt and decode mixes
- Validate position and KV state across chunks
- Prioritize decode within the SLO policy
- Export chunk size and wait telemetry
What to measure
- Chunk-size distribution
- TTFT by prompt length
- TPOT during concurrent prefill
- Prefill iterations per request
- Cancelled partial-KV bytes
From one GPU to a production service
A local benchmark has one token budget. A multi-class service may reserve decode tokens for interactive users while allowing batch prefills to consume leftovers. The policy should be explicit enough that operators can predict who slows down under pressure.
Prefix caching changes chunk work. A 50,000-token prompt with a 45,000-token hit needs only 5,000 uncached tokens; chunking based on raw length wastes scheduling rounds. Compute chunks from the uncached suffix.
Distributed replicas may choose different chunk sizes based on TP degree and GPU generation. Advertise the supported scheduling profile so the router does not compare queue lengths without understanding service rate.
Design-review questions
- Is chunk size based on uncached or total prompt tokens?
- Which class owns decode-first priority?
- How does chunk size change with active decode load?
- Are partial KV blocks safe across cancellation?
- Do replica profiles expose different prefill service rates?
How it connects to the rest of the series
Continuous batching provides the iteration loop. PagedAttention holds partial KV state. Prefill-decode disaggregation can remove prefill interference entirely when transfer cost is acceptable.
From equation to implementation
The scheduler’s token budget combines unlike work: one decode token for each active sequence and a chunk of many prefill tokens. Their compute cost is not identical, but a token budget is a tractable approximation. Advanced policies weight prefill and decode tokens differently.
Chunk boundaries should align with KV block allocation where practical. If a chunk ends inside a cache block, partial-block handling must be correct and may affect prefix reuse. Position encodings and multimodal embeddings must advance exactly as in monolithic prefill.
Implementation sketch
remaining = prompt_tokens
while remaining:
decode_cost = tokens_for_active_decodes()
room = max(0, iteration_budget - decode_cost)
chunk = remaining.take(min(room, adaptive_chunk_limit))
reserve_kv(chunk)
run_iteration(active_decodes, chunk)
commit_partial_prefill_state(chunk)
remaining -= chunk
admit_to_decode(request)Capacity planning
Choose a maximum chunk from the largest TPOT stall the service can tolerate, then verify that the resulting GEMM is still efficient. Keep a smaller emergency chunk for high decode pressure and a larger mode for idle periods.
Benchmarking without fooling yourself
- Run long prefills concurrently with established streams.
- Sweep chunk size and active decode count together.
- Compare TTFT, TPOT, and total prompt throughput.
- Test prefix-hit and multimodal prompt boundaries.
A production failure to design for
A request is cancelled after seven chunks, but only the final chunk’s blocks are released. Repeated cancellations leak most of the KV pool and eventually trigger preemption storms. Track all partial blocks under one request owner.
Primary references
The takeaway
Chunked prefill is traffic shaping inside the model server. It trades a little prefill efficiency for a much more predictable shared service.
