18/20 - Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else

#chunked-prefill #prefill #scheduler #llm-serving #latency

A 100,000-token prompt can occupy a GPU long enough to make every active stream stutter. Chunked prefill breaks that prompt into bounded pieces so decode iterations can keep making progress between chunks.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A delivery truck carrying one enormous order should not block the only loading dock all afternoon. Split the order into pallets and let urgent small shipments pass between them.

Follow the state and work from left to right.

Description: Start with Long prompt, where split into token chunks. The middle stage, Mixed scheduler, fit decode plus chunk. The final stage, Completed prefill, shows the observable result: join decode set. The arrows describe dependency order, not necessarily separate services.

What actually happens

Prefill computes attention for prompt tokens and builds KV state. It is typically compute-heavy and can form large matrix operations. Decode is usually memory-bandwidth-heavy and latency-sensitive. Co-scheduling an unbounded prefill can therefore disrupt inter-token latency.

Chunked prefill limits the prompt tokens admitted in one iteration. The scheduler fills a token budget with decode work first or according to policy, then uses remaining capacity for one or more prefill chunks.

Partial prompt state must remain correct across chunks. Causal positions, attention metadata, KV block allocation, and prefix-cache matches must advance consistently. Cancellation must free the partially built cache.

A worked example

Assume a scheduler budget of 4,096 tokens per iteration and active decode uses 256 token positions. A 20,000-token prompt can enter as chunks near 3,840 tokens instead of monopolizing one giant iteration. Existing streams continue to receive decode steps between chunks.

The performance model

Smaller chunks protect TPOT but add more scheduling rounds and may reduce prefill GEMM efficiency. Larger chunks improve prefill throughput but create longer stalls. Tune against both TTFT for new prompts and TPOT for active streams.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Chunked prefill changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

A fixed chunk size is not always optimal. The scheduler can adapt to active decode load, prompt length, prefix hits, and SLO class. When decode load is low, larger chunks are efficient; under pressure, shrink them.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Monolithic prefill, characterized by one large prompt iteration and efficient large gemm. The right panel applies Chunked prefill, changing the cost profile to bounded prompt slices and more scheduler rounds. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Mixed long-prompt and streaming traffic
Continuous-batching engines
Services with strict TPOT SLOs

Where it disappoints

Choosing chunk size from one benchmark
Forgetting partial-KV cleanup on cancellation
Letting chunks violate priority policy
Reporting prefill throughput without decode interference

Production checklist

Set an explicit per-iteration token budget
Benchmark multiple prompt and decode mixes
Validate position and KV state across chunks
Prioritize decode within the SLO policy
Export chunk size and wait telemetry

What to measure

Chunk-size distribution
TTFT by prompt length
TPOT during concurrent prefill
Prefill iterations per request
Cancelled partial-KV bytes

From one GPU to a production service

A local benchmark has one token budget. A multi-class service may reserve decode tokens for interactive users while allowing batch prefills to consume leftovers. The policy should be explicit enough that operators can predict who slows down under pressure.

Prefix caching changes chunk work. A 50,000-token prompt with a 45,000-token hit needs only 5,000 uncached tokens; chunking based on raw length wastes scheduling rounds. Compute chunks from the uncached suffix.

Distributed replicas may choose different chunk sizes based on TP degree and GPU generation. Advertise the supported scheduling profile so the router does not compare queue lengths without understanding service rate.

Design-review questions

Is chunk size based on uncached or total prompt tokens?
Which class owns decode-first priority?
How does chunk size change with active decode load?
Are partial KV blocks safe across cancellation?
Do replica profiles expose different prefill service rates?

How it connects to the rest of the series

Continuous batching provides the iteration loop. PagedAttention holds partial KV state. Prefill-decode disaggregation can remove prefill interference entirely when transfer cost is acceptable.

From equation to implementation

The scheduler’s token budget combines unlike work: one decode token for each active sequence and a chunk of many prefill tokens. Their compute cost is not identical, but a token budget is a tractable approximation. Advanced policies weight prefill and decode tokens differently.

Chunk boundaries should align with KV block allocation where practical. If a chunk ends inside a cache block, partial-block handling must be correct and may affect prefix reuse. Position encodings and multimodal embeddings must advance exactly as in monolithic prefill.

Implementation sketch

remaining = prompt_tokens
while remaining:
    decode_cost = tokens_for_active_decodes()
    room = max(0, iteration_budget - decode_cost)
    chunk = remaining.take(min(room, adaptive_chunk_limit))
    reserve_kv(chunk)
    run_iteration(active_decodes, chunk)
    commit_partial_prefill_state(chunk)
    remaining -= chunk
admit_to_decode(request)

Capacity planning

Choose a maximum chunk from the largest TPOT stall the service can tolerate, then verify that the resulting GEMM is still efficient. Keep a smaller emergency chunk for high decode pressure and a larger mode for idle periods.

Benchmarking without fooling yourself

Run long prefills concurrently with established streams.
Sweep chunk size and active decode count together.
Compare TTFT, TPOT, and total prompt throughput.
Test prefix-hit and multimodal prompt boundaries.

A production failure to design for

A request is cancelled after seven chunks, but only the final chunk’s blocks are released. Repeated cancellations leak most of the KV pool and eventually trigger preemption storms. Track all partial blocks under one request owner.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Budget to Execute, then Adapt and Clean. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Chunked prefill divides a long prompt into bounded token segments and schedules those segments across iterations. KV state is appended in order, but active decode requests can run between chunks. The mechanism turns one monolithic latency spike into controlled, preemptible work.

Chunking changes scheduling granularity without changing the model-visible sequence.

Description: Follow the state from Plan through Prefill A and Interleave to Prefill B. Each box is an ownership or computation boundary. In particular, position ids and attention visibility must remain identical to monolithic prefill. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Chunk size balances efficiency and responsiveness. Large chunks use efficient kernels but create longer non-preemptible intervals. Small chunks improve fairness but add launches, scheduler work, and boundary overhead. Adaptive sizing can use active decode count, queue deadlines, and GPU utilization.

Choose chunks from inter-token SLO and deployed kernel behavior.

Description: The bars compare Monolithic prefill with Bounded chunks on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Decode and urgent requests progress between chunks.”, remains larger than the risk, “More launches and scheduler transitions reduce efficiency.”, under production traffic.

Correctness requires global positions, causal visibility, and KV append order to match one-pass prefill. Multimodal embeddings, prefix-cache hits, packed prompts, and sliding windows create boundary cases. Compare logits after every chunk boundary against the monolithic reference.

Explicit partial state prevents decode from attending to an incomplete prompt.

Description: State advances from Planned to Partial, Yielded, and finally Complete. The labels below each state identify what becomes true at that boundary. The governing invariant is: A partially prefetched request is never visible as a decodable sequence. Retries and cancellation must preserve the same transition rules.

Chunked prefill is correct only when partitioning is invisible to the model.

Description: The four panels are independent review axes: Positions, Masks, Cache, and Scheduling. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Golden tests should sweep every boundary around block and page sizes.

Partition-transparent numerical tests are mandatory before enabling chunking.

Description: This is a causal chain, not four unrelated symptoms. Chunk resumes triggers KV appends, which creates Decode degrades. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Chunked prefill is traffic shaping inside the model server. It trades a little prefill efficiency for a much more predictable shared service.

17/20 - Continuous Batching: The GPU Schedule That Never Stands Still 19/20 - Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream