Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else

#chunked-prefill #prefill #scheduler #llm-serving #latency

A 100,000-token prompt can occupy a GPU long enough to make every active stream stutter. Chunked prefill breaks that prompt into bounded pieces so decode iterations can keep making progress between chunks.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A delivery truck carrying one enormous order should not block the only loading dock all afternoon. Split the order into pallets and let urgent small shipments pass between them.

Follow the state and work from left to right.

What actually happens

Prefill computes attention for prompt tokens and builds KV state. It is typically compute-heavy and can form large matrix operations. Decode is usually memory-bandwidth-heavy and latency-sensitive. Co-scheduling an unbounded prefill can therefore disrupt inter-token latency.

Chunked prefill limits the prompt tokens admitted in one iteration. The scheduler fills a token budget with decode work first or according to policy, then uses remaining capacity for one or more prefill chunks.

Partial prompt state must remain correct across chunks. Causal positions, attention metadata, KV block allocation, and prefix-cache matches must advance consistently. Cancellation must free the partially built cache.

A worked example

Assume a scheduler budget of 4,096 tokens per iteration and active decode uses 256 token positions. A 20,000-token prompt can enter as chunks near 3,840 tokens instead of monopolizing one giant iteration. Existing streams continue to receive decode steps between chunks.

The performance model

Smaller chunks protect TPOT but add more scheduling rounds and may reduce prefill GEMM efficiency. Larger chunks improve prefill throughput but create longer stalls. Tune against both TTFT for new prompts and TPOT for active streams.

Expert lens

A fixed chunk size is not always optimal. The scheduler can adapt to active decode load, prompt length, prefix hits, and SLO class. When decode load is low, larger chunks are efficient; under pressure, shrink them.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Mixed long-prompt and streaming traffic
Continuous-batching engines
Services with strict TPOT SLOs

Where it disappoints

Choosing chunk size from one benchmark
Forgetting partial-KV cleanup on cancellation
Letting chunks violate priority policy
Reporting prefill throughput without decode interference

Production checklist

Set an explicit per-iteration token budget
Benchmark multiple prompt and decode mixes
Validate position and KV state across chunks
Prioritize decode within the SLO policy
Export chunk size and wait telemetry

What to measure

Chunk-size distribution
TTFT by prompt length
TPOT during concurrent prefill
Prefill iterations per request
Cancelled partial-KV bytes

From one GPU to a production service

A local benchmark has one token budget. A multi-class service may reserve decode tokens for interactive users while allowing batch prefills to consume leftovers. The policy should be explicit enough that operators can predict who slows down under pressure.

Prefix caching changes chunk work. A 50,000-token prompt with a 45,000-token hit needs only 5,000 uncached tokens; chunking based on raw length wastes scheduling rounds. Compute chunks from the uncached suffix.

Distributed replicas may choose different chunk sizes based on TP degree and GPU generation. Advertise the supported scheduling profile so the router does not compare queue lengths without understanding service rate.

Design-review questions

Is chunk size based on uncached or total prompt tokens?
Which class owns decode-first priority?
How does chunk size change with active decode load?
Are partial KV blocks safe across cancellation?
Do replica profiles expose different prefill service rates?

How it connects to the rest of the series

Continuous batching provides the iteration loop. PagedAttention holds partial KV state. Prefill-decode disaggregation can remove prefill interference entirely when transfer cost is acceptable.

From equation to implementation

The scheduler’s token budget combines unlike work: one decode token for each active sequence and a chunk of many prefill tokens. Their compute cost is not identical, but a token budget is a tractable approximation. Advanced policies weight prefill and decode tokens differently.

Chunk boundaries should align with KV block allocation where practical. If a chunk ends inside a cache block, partial-block handling must be correct and may affect prefix reuse. Position encodings and multimodal embeddings must advance exactly as in monolithic prefill.

Implementation sketch

remaining = prompt_tokens
while remaining:
    decode_cost = tokens_for_active_decodes()
    room = max(0, iteration_budget - decode_cost)
    chunk = remaining.take(min(room, adaptive_chunk_limit))
    reserve_kv(chunk)
    run_iteration(active_decodes, chunk)
    commit_partial_prefill_state(chunk)
    remaining -= chunk
admit_to_decode(request)

Capacity planning

Choose a maximum chunk from the largest TPOT stall the service can tolerate, then verify that the resulting GEMM is still efficient. Keep a smaller emergency chunk for high decode pressure and a larger mode for idle periods.

Benchmarking without fooling yourself

Run long prefills concurrently with established streams.
Sweep chunk size and active decode count together.
Compare TTFT, TPOT, and total prompt throughput.
Test prefix-hit and multimodal prompt boundaries.

A production failure to design for

A request is cancelled after seven chunks, but only the final chunk’s blocks are released. Repeated cancellations leak most of the KV pool and eventually trigger preemption storms. Track all partial blocks under one request owner.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Chunked prefill is traffic shaping inside the model server. It trades a little prefill efficiency for a much more predictable shared service.

Continuous Batching: The GPU Schedule That Never Stands Still Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream