Skip to content
17/20 - Continuous Batching: The GPU Schedule That Never Stands Still

17/20 - Continuous Batching: The GPU Schedule That Never Stands Still

A static batch waits for its longest sequence. A continuous batch changes membership after every decoding iteration: finished requests leave, waiting requests enter, and the GPU keeps working on the largest useful set of active sequences.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A shuttle that waits for every passenger’s entire vacation is inefficient. Continuous batching behaves like a metro: riders leave at each stop and new riders board, while the train keeps moving.

MECHANISM FLOWContinuous Batching: request path01Waiting requestsAdmission and token budgetJoin next iteration02Decode schedulerRun one token stepRemove finished work03Active batchChanges every stepMaintain GPU occupancyINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Waiting requests, where admission and token budget. The middle stage, Decode scheduler, run one token step. The final stage, Active batch, shows the observable result: changes every step. The arrows describe dependency order, not necessarily separate services.

What actually happens

Iteration-level scheduling runs one model iteration for the current active set, updates each sequence, retires completed or cancelled requests, and admits new work before the next iteration. Request lifetimes no longer define fixed batch boundaries.

The scheduler budgets tokens and KV blocks, not merely request count. A long prompt can consume far more work than one decode token, and a nearly full KV pool may prevent admission even when compute appears idle.

Preemption becomes possible when high-priority work arrives or memory is exhausted. The engine may swap KV state, recompute evicted prefixes, or pause lower-priority sequences. Each choice changes latency and bandwidth.

A worked example

Four sequences start together with output lengths 8, 20, 60, and 100. A static batch carries empty slots after the short sequences finish. Continuous batching replaces each completed sequence with a waiting request at the next iteration, keeping the effective batch dense.

The performance model

Throughput improves by reducing idle batch slots. Tail latency depends on admission, fairness, maximum active tokens, and preemption. A throughput-maximizing scheduler can starve large or low-priority requests without explicit policy.

PHASE FITWhere Continuous batching changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensitySchedules prompt chunks into budgetsDECODEOne new token per iterationWeight and KV bandwidth pressureRefills active rows every iterationPROVE IT WITHTTFT, TPOT, and useful tokens/secDEPLOYMENT DECISIONReserve service for both phases
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Continuous batching changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Token budgets are usually more stable than sequence-count limits. Decode cost scales with active sequences and context lengths, while prefill cost scales with prompt tokens. Modern schedulers often maintain separate budgets or chunk prefills to fit decode work around them.

TRADE-OFF MAPContinuous Batching: the tradeoffBASELINEStatic request batchMembership fixed to completionShort jobs leave holesSimple accountingPoor mixed-length utilizationVSOPTIMIZEDContinuous batchMembership changes each stepFinished slots refillToken and KV budgetingHigher scheduler complexityMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Static request batch, characterized by membership fixed to completion and short jobs leave holes. The right panel applies Continuous batch, changing the cost profile to membership changes each step and finished slots refill. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Online generation with varied output lengths
  • Steady concurrent traffic
  • Paged KV systems that support flexible admission

Where it disappoints

  • Limiting only by request count
  • Ignoring fairness under short-job traffic
  • Preempting without accounting for recompute cost
  • Mixing long prefills into decode iterations blindly

Production checklist

  • Set token and KV admission budgets
  • Define fairness and priority policy
  • Bound preemption and recomputation
  • Propagate cancellation before next iteration
  • Stress mixed prompt and output lengths

What to measure

  • Active sequences and tokens per iteration
  • Admission wait and starvation age
  • Preemptions, swaps, and recomputed tokens
  • Batch occupancy over time
  • TPOT by priority class

From one GPU to a production service

One engine scheduler can use local queue age. A fleet must also route new requests among replicas. The global router should consider queue, KV locality, model capability, and admission signals without attempting to micromanage token iterations.

Autoscaling has delayed effects. A new replica must load weights and warm kernels before it drains queue. Meanwhile, aggressive routing to the least-loaded cold replica may worsen latency. Publish readiness and capacity separately.

Fairness spans retries. If a preempted request returns to the head of the queue indefinitely, it can dominate capacity; if it returns to the tail, it may starve. Preserve service attained and age across preemption.

Design-review questions

  • Which decisions belong to global routing versus local scheduling?
  • How is fairness preserved across preemption?
  • What capacity signal excludes cold replicas?
  • Can adapters and sampling modes share iterations?
  • Which overload limit rejects before KV exhaustion?

How it connects to the rest of the series

Dynamic batching groups arrivals before an execution. Continuous batching changes membership during generation. PagedAttention makes the KV allocation flexible enough for this scheduler.

From equation to implementation

At each iteration the scheduler solves a constrained packing problem: maximize useful token work subject to KV blocks, token budget, sequence count, adapter compatibility, and deadlines. Exact optimization is too expensive, so engines use greedy priority and admission heuristics.

Fairness can be expressed through virtual time, age, deficit tokens, or priority deadlines. Shortest-job-like policies improve mean latency but can starve long generations. Production schedulers need a measurable fairness contract.

Implementation sketch

while running:
    retire_finished_and_cancelled(active)
    reclaim_kv_blocks()
    waiting.update_age_and_deadlines()
    budget = iteration_token_budget
    admit_decode_tokens(active, budget)
    admit_prefill_chunks(waiting, remaining(budget))
    reserve_all_required_kv()
    outputs = engine.step(active_batch)
    update_sequence_state(outputs)

Capacity planning

Maximum active sequences is less useful than maximum scheduled tokens and maximum KV blocks. Reserve capacity for one iteration of growth, speculative branches if enabled, and emergency admission for high-priority traffic.

Benchmarking without fooling yourself

  • Use a trace with mixed arrivals and output lengths.
  • Plot latency by request size and priority, not only aggregate p99.
  • Force block pressure to exercise preemption.
  • Compare useful tokens per iteration with scheduled token slots.

A production failure to design for

A policy always admits short prompts first. Under steady chat traffic, a long document request remains queued for minutes despite available partial capacity. Add aging or deficit-based fairness and alert on oldest-request age.

OPERATING LOOPOperational loop1BudgetTokens KV and slotsReserve growth2ScheduleDeadlines and fairnessChunk prefills3ExecuteOne bounded iterationRetire promptly4AuditStarvation and wastePreemption costMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Budget to Schedule, then Execute and Audit. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Continuous batching schedules at decoder-iteration granularity. Finished sequences leave immediately, newly admitted sequences join open slots, and prefill work competes with decode work under a token budget. The batch is therefore a changing set of sequence states rather than a fixed request group.

One continuous-batching iterationReapRemove finished rowsRelease KV blocksFinalize responsesAdmitCheck token budgetReserve KV growthHonor priorityScheduleMix prefill and decodeBuild token batchBound chunk sizeExecuteRun one model stepUpdate every sequenceRepeat immediatelyThe scheduler accounts for tokens and memory, not a fixed request count.
Iteration-level admission keeps GPU slots productive as sequences finish at different times.

How to read this diagram: Follow the state from Reap through Admit and Schedule to Execute. Each box is an ownership or computation boundary. In particular, the scheduler accounts for tokens and memory, not a fixed request count. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Prefill and decode have different shapes. Large prefill chunks improve compute efficiency but can delay latency-sensitive decode. A token-budget scheduler caps total work per iteration and may chunk prompts so decode rows continue making progress. Fairness needs attained-service or deficit accounting because long prompts otherwise consume repeated large allocations.

Static batches leave holes; continuous batches refill themStatic batchidle after finishesContinuous batchslots stay activeScheduler riskGreedy admission can starve decode or long prompts.Utilization gainCompleted rows are replaced without draining the batch.
Higher utilization is valuable only while per-request latency remains bounded.

How to read this diagram: The bars compare Static batch with Continuous batch on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Completed rows are replaced without draining the batch.”, remains larger than the risk, “Greedy admission can starve decode or long prompts.”, under production traffic.

Memory admission precedes compute admission. A request that fits the next token budget may not fit future KV growth. Reserve prompt blocks plus a bounded decode allowance, then update reservations as output progresses. Reject or queue before allocator exhaustion.

Sequence state in a live batchWaitingdeadline and quotaPrefillingprompt chunks runDecodingone token per stepFinishedslot and KV releasedA sequence may move from prefill to decode only after its prompt state is complete.
The scheduler coordinates heterogeneous sequence phases inside one iteration loop.

How to read this diagram: State advances from Waiting to Prefilling, Decoding, and finally Finished. The labels below each state identify what becomes true at that boundary. The governing invariant is: A sequence may move from prefill to decode only after its prompt state is complete. Retries and cancellation must preserve the same transition rules.

Four scheduler objectivesGoodputUseful tokens per secondAvoid idle slotsLatencyTTFT and inter-tokenDeadline urgencyFairnessTenant and request agingAttained serviceMemoryKV block envelopeFragmentation reserveExpose per-iteration prefill tokens, decode tokens, and admission reason.
Continuous batching is a multi-objective online scheduler.

How to read this diagram: The four panels are independent review axes: Goodput, Latency, Fairness, and Memory. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Expose per-iteration prefill tokens, decode tokens, and admission reason.

Prefill greed can freeze active decodersLong prompts arriveScheduler fills budgetPrefill kernels run longDecode waitsActive users see gapsInter-token SLO failsRetries beginMore prompts enterQueue pressure risesControlChunk prefillReserve decode budgetProtect a minimum decode service share during every overloaded iteration.
Maximum throughput is not goodput when active streams stop making progress.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Long prompts arrive triggers Decode waits, which creates Retries begin. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Continuous batching turns generation into a living schedule. Its value comes from dense GPU work; its quality comes from fair admission and honest memory accounting.