Skip to content
18/20 - Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else

18/20 - Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else

A 100,000-token prompt can occupy a GPU long enough to make every active stream stutter. Chunked prefill breaks that prompt into bounded pieces so decode iterations can keep making progress between chunks.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A delivery truck carrying one enormous order should not block the only loading dock all afternoon. Split the order into pallets and let urgent small shipments pass between them.

MECHANISM FLOWChunked Prefill: request path01Long promptSplit into token chunksTrack partial KV02Mixed schedulerFit decode plus chunkRespect token budget03Completed prefillJoin decode setGenerate normallyINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Long prompt, where split into token chunks. The middle stage, Mixed scheduler, fit decode plus chunk. The final stage, Completed prefill, shows the observable result: join decode set. The arrows describe dependency order, not necessarily separate services.

What actually happens

Prefill computes attention for prompt tokens and builds KV state. It is typically compute-heavy and can form large matrix operations. Decode is usually memory-bandwidth-heavy and latency-sensitive. Co-scheduling an unbounded prefill can therefore disrupt inter-token latency.

Chunked prefill limits the prompt tokens admitted in one iteration. The scheduler fills a token budget with decode work first or according to policy, then uses remaining capacity for one or more prefill chunks.

Partial prompt state must remain correct across chunks. Causal positions, attention metadata, KV block allocation, and prefix-cache matches must advance consistently. Cancellation must free the partially built cache.

A worked example

Assume a scheduler budget of 4,096 tokens per iteration and active decode uses 256 token positions. A 20,000-token prompt can enter as chunks near 3,840 tokens instead of monopolizing one giant iteration. Existing streams continue to receive decode steps between chunks.

The performance model

Smaller chunks protect TPOT but add more scheduling rounds and may reduce prefill GEMM efficiency. Larger chunks improve prefill throughput but create longer stalls. Tune against both TTFT for new prompts and TPOT for active streams.

PHASE FITWhere Chunked prefill changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensitySplits long compute into bounded workDECODEOne new token per iterationWeight and KV bandwidth pressureProtects active streams between chunksPROVE IT WITHChunk TTFT and decode inter-token gapDEPLOYMENT DECISIONSize chunks from the decode SLO
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Chunked prefill changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

A fixed chunk size is not always optimal. The scheduler can adapt to active decode load, prompt length, prefix hits, and SLO class. When decode load is low, larger chunks are efficient; under pressure, shrink them.

TRADE-OFF MAPChunked Prefill: the tradeoffBASELINEMonolithic prefillOne large prompt iterationEfficient large GEMMCan stall active decodeSimple state transitionVSOPTIMIZEDChunked prefillBounded prompt slicesMore scheduler roundsProtects inter-token latencyPartial state to manageMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Monolithic prefill, characterized by one large prompt iteration and efficient large gemm. The right panel applies Chunked prefill, changing the cost profile to bounded prompt slices and more scheduler rounds. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Mixed long-prompt and streaming traffic
  • Continuous-batching engines
  • Services with strict TPOT SLOs

Where it disappoints

  • Choosing chunk size from one benchmark
  • Forgetting partial-KV cleanup on cancellation
  • Letting chunks violate priority policy
  • Reporting prefill throughput without decode interference

Production checklist

  • Set an explicit per-iteration token budget
  • Benchmark multiple prompt and decode mixes
  • Validate position and KV state across chunks
  • Prioritize decode within the SLO policy
  • Export chunk size and wait telemetry

What to measure

  • Chunk-size distribution
  • TTFT by prompt length
  • TPOT during concurrent prefill
  • Prefill iterations per request
  • Cancelled partial-KV bytes

From one GPU to a production service

A local benchmark has one token budget. A multi-class service may reserve decode tokens for interactive users while allowing batch prefills to consume leftovers. The policy should be explicit enough that operators can predict who slows down under pressure.

Prefix caching changes chunk work. A 50,000-token prompt with a 45,000-token hit needs only 5,000 uncached tokens; chunking based on raw length wastes scheduling rounds. Compute chunks from the uncached suffix.

Distributed replicas may choose different chunk sizes based on TP degree and GPU generation. Advertise the supported scheduling profile so the router does not compare queue lengths without understanding service rate.

Design-review questions

  • Is chunk size based on uncached or total prompt tokens?
  • Which class owns decode-first priority?
  • How does chunk size change with active decode load?
  • Are partial KV blocks safe across cancellation?
  • Do replica profiles expose different prefill service rates?

How it connects to the rest of the series

Continuous batching provides the iteration loop. PagedAttention holds partial KV state. Prefill-decode disaggregation can remove prefill interference entirely when transfer cost is acceptable.

From equation to implementation

The scheduler’s token budget combines unlike work: one decode token for each active sequence and a chunk of many prefill tokens. Their compute cost is not identical, but a token budget is a tractable approximation. Advanced policies weight prefill and decode tokens differently.

Chunk boundaries should align with KV block allocation where practical. If a chunk ends inside a cache block, partial-block handling must be correct and may affect prefix reuse. Position encodings and multimodal embeddings must advance exactly as in monolithic prefill.

Implementation sketch

remaining = prompt_tokens
while remaining:
    decode_cost = tokens_for_active_decodes()
    room = max(0, iteration_budget - decode_cost)
    chunk = remaining.take(min(room, adaptive_chunk_limit))
    reserve_kv(chunk)
    run_iteration(active_decodes, chunk)
    commit_partial_prefill_state(chunk)
    remaining -= chunk
admit_to_decode(request)

Capacity planning

Choose a maximum chunk from the largest TPOT stall the service can tolerate, then verify that the resulting GEMM is still efficient. Keep a smaller emergency chunk for high decode pressure and a larger mode for idle periods.

Benchmarking without fooling yourself

  • Run long prefills concurrently with established streams.
  • Sweep chunk size and active decode count together.
  • Compare TTFT, TPOT, and total prompt throughput.
  • Test prefix-hit and multimodal prompt boundaries.

A production failure to design for

A request is cancelled after seven chunks, but only the final chunk’s blocks are released. Repeated cancellations leak most of the KV pool and eventually trigger preemption storms. Track all partial blocks under one request owner.

OPERATING LOOPOperational loop1BudgetDecode first or policyChunk ceiling2ExecutePartial positionsCommit KV safely3AdaptShrink under pressureGrow when idle4CleanCancel all blocksTrack TTFT and TPOTMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Budget to Execute, then Adapt and Clean. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Chunked prefill divides a long prompt into bounded token segments and schedules those segments across iterations. KV state is appended in order, but active decode requests can run between chunks. The mechanism turns one monolithic latency spike into controlled, preemptible work.

A long prompt enters in bounded chunksPlanTokenize full promptChoose chunk boundaryReserve KV blocksPrefill AProcess first tokensWrite ordered KVYield schedulerInterleaveServe decode rowsAdmit urgent workRespect token budgetPrefill BResume next offsetComplete prompt stateBegin decodePosition IDs and attention visibility must remain identical to monolithic prefill.
Chunking changes scheduling granularity without changing the model-visible sequence.

How to read this diagram: Follow the state from Plan through Prefill A and Interleave to Prefill B. Each box is an ownership or computation boundary. In particular, position ids and attention visibility must remain identical to monolithic prefill. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Chunk size balances efficiency and responsiveness. Large chunks use efficient kernels but create longer non-preemptible intervals. Small chunks improve fairness but add launches, scheduler work, and boundary overhead. Adaptive sizing can use active decode count, queue deadlines, and GPU utilization.

Chunk size trades kernel efficiency for preemptionMonolithic prefilllong blocking spanBounded chunksinterleavableSmall-chunk costMore launches and scheduler transitions reduce efficiency.Latency benefitDecode and urgent requests progress between chunks.
Choose chunks from inter-token SLO and deployed kernel behavior.

How to read this diagram: The bars compare Monolithic prefill with Bounded chunks on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Decode and urgent requests progress between chunks.”, remains larger than the risk, “More launches and scheduler transitions reduce efficiency.”, under production traffic.

Correctness requires global positions, causal visibility, and KV append order to match one-pass prefill. Multimodal embeddings, prefix-cache hits, packed prompts, and sliding windows create boundary cases. Compare logits after every chunk boundary against the monolithic reference.

Chunked prompt lifecyclePlannedoffsets are fixedPartialsome KV existsYieldedstate remains ownedCompletedecode may beginA partially prefetched request is never visible as a decodable sequence.
Explicit partial state prevents decode from attending to an incomplete prompt.

How to read this diagram: State advances from Planned to Partial, Yielded, and finally Complete. The labels below each state identify what becomes true at that boundary. The governing invariant is: A partially prefetched request is never visible as a decodable sequence. Retries and cancellation must preserve the same transition rules.

Four chunk-boundary concernsPositionsGlobal rotary offsetsNo reset per chunkMasksCausal visibilitySliding-window rulesCachePrefix hit alignmentOrdered KV appendSchedulingYield and resumeCancellation cleanupGolden tests should sweep every boundary around block and page sizes.
Chunked prefill is correct only when partitioning is invisible to the model.

How to read this diagram: The four panels are independent review axes: Positions, Masks, Cache, and Scheduling. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Golden tests should sweep every boundary around block and page sizes.

A position reset silently corrupts a long promptChunk resumesOffset restarts at zeroTensor shapes still fitKV appendsWrong rotary phaseNo runtime errorDecode degradesContext retrieval failsOnly long prompts sufferControlCarry global offsetsCompare boundary logitsCorrectness tests need long prompts; short smoke tests never cross the faulty boundary.
Partition-transparent numerical tests are mandatory before enabling chunking.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Chunk resumes triggers KV appends, which creates Decode degrades. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Chunked prefill is traffic shaping inside the model server. It trades a little prefill efficiency for a much more predictable shared service.