17/20 - Continuous Batching: The GPU Schedule That Never Stands Still

#continuous-batching #iteration-level-scheduling #llm-serving #scheduler #vllm

A static batch waits for its longest sequence. A continuous batch changes membership after every decoding iteration: finished requests leave, waiting requests enter, and the GPU keeps working on the largest useful set of active sequences.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A shuttle that waits for every passenger’s entire vacation is inefficient. Continuous batching behaves like a metro: riders leave at each stop and new riders board, while the train keeps moving.

Follow the state and work from left to right.

Description: Start with Waiting requests, where admission and token budget. The middle stage, Decode scheduler, run one token step. The final stage, Active batch, shows the observable result: changes every step. The arrows describe dependency order, not necessarily separate services.

What actually happens

Iteration-level scheduling runs one model iteration for the current active set, updates each sequence, retires completed or cancelled requests, and admits new work before the next iteration. Request lifetimes no longer define fixed batch boundaries.

The scheduler budgets tokens and KV blocks, not merely request count. A long prompt can consume far more work than one decode token, and a nearly full KV pool may prevent admission even when compute appears idle.

Preemption becomes possible when high-priority work arrives or memory is exhausted. The engine may swap KV state, recompute evicted prefixes, or pause lower-priority sequences. Each choice changes latency and bandwidth.

A worked example

Four sequences start together with output lengths 8, 20, 60, and 100. A static batch carries empty slots after the short sequences finish. Continuous batching replaces each completed sequence with a waiting request at the next iteration, keeping the effective batch dense.

The performance model

Throughput improves by reducing idle batch slots. Tail latency depends on admission, fairness, maximum active tokens, and preemption. A throughput-maximizing scheduler can starve large or low-priority requests without explicit policy.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Continuous batching changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Token budgets are usually more stable than sequence-count limits. Decode cost scales with active sequences and context lengths, while prefill cost scales with prompt tokens. Modern schedulers often maintain separate budgets or chunk prefills to fit decode work around them.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Static request batch, characterized by membership fixed to completion and short jobs leave holes. The right panel applies Continuous batch, changing the cost profile to membership changes each step and finished slots refill. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Online generation with varied output lengths
Steady concurrent traffic
Paged KV systems that support flexible admission

Where it disappoints

Limiting only by request count
Ignoring fairness under short-job traffic
Preempting without accounting for recompute cost
Mixing long prefills into decode iterations blindly

Production checklist

Set token and KV admission budgets
Define fairness and priority policy
Bound preemption and recomputation
Propagate cancellation before next iteration
Stress mixed prompt and output lengths

What to measure

Active sequences and tokens per iteration
Admission wait and starvation age
Preemptions, swaps, and recomputed tokens
Batch occupancy over time
TPOT by priority class

From one GPU to a production service

One engine scheduler can use local queue age. A fleet must also route new requests among replicas. The global router should consider queue, KV locality, model capability, and admission signals without attempting to micromanage token iterations.

Autoscaling has delayed effects. A new replica must load weights and warm kernels before it drains queue. Meanwhile, aggressive routing to the least-loaded cold replica may worsen latency. Publish readiness and capacity separately.

Fairness spans retries. If a preempted request returns to the head of the queue indefinitely, it can dominate capacity; if it returns to the tail, it may starve. Preserve service attained and age across preemption.

Design-review questions

Which decisions belong to global routing versus local scheduling?
How is fairness preserved across preemption?
What capacity signal excludes cold replicas?
Can adapters and sampling modes share iterations?
Which overload limit rejects before KV exhaustion?

How it connects to the rest of the series

Dynamic batching groups arrivals before an execution. Continuous batching changes membership during generation. PagedAttention makes the KV allocation flexible enough for this scheduler.

From equation to implementation

At each iteration the scheduler solves a constrained packing problem: maximize useful token work subject to KV blocks, token budget, sequence count, adapter compatibility, and deadlines. Exact optimization is too expensive, so engines use greedy priority and admission heuristics.

Fairness can be expressed through virtual time, age, deficit tokens, or priority deadlines. Shortest-job-like policies improve mean latency but can starve long generations. Production schedulers need a measurable fairness contract.

Implementation sketch

while running:
    retire_finished_and_cancelled(active)
    reclaim_kv_blocks()
    waiting.update_age_and_deadlines()
    budget = iteration_token_budget
    admit_decode_tokens(active, budget)
    admit_prefill_chunks(waiting, remaining(budget))
    reserve_all_required_kv()
    outputs = engine.step(active_batch)
    update_sequence_state(outputs)

Capacity planning

Maximum active sequences is less useful than maximum scheduled tokens and maximum KV blocks. Reserve capacity for one iteration of growth, speculative branches if enabled, and emergency admission for high-priority traffic.

Benchmarking without fooling yourself

Use a trace with mixed arrivals and output lengths.
Plot latency by request size and priority, not only aggregate p99.
Force block pressure to exercise preemption.
Compare useful tokens per iteration with scheduled token slots.

A production failure to design for

A policy always admits short prompts first. Under steady chat traffic, a long document request remains queued for minutes despite available partial capacity. Add aging or deficit-based fairness and alert on oldest-request age.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Budget to Schedule, then Execute and Audit. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Continuous batching schedules at decoder-iteration granularity. Finished sequences leave immediately, newly admitted sequences join open slots, and prefill work competes with decode work under a token budget. The batch is therefore a changing set of sequence states rather than a fixed request group.

Iteration-level admission keeps GPU slots productive as sequences finish at different times.

Description: Follow the state from Reap through Admit and Schedule to Execute. Each box is an ownership or computation boundary. In particular, the scheduler accounts for tokens and memory, not a fixed request count. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Prefill and decode have different shapes. Large prefill chunks improve compute efficiency but can delay latency-sensitive decode. A token-budget scheduler caps total work per iteration and may chunk prompts so decode rows continue making progress. Fairness needs attained-service or deficit accounting because long prompts otherwise consume repeated large allocations.

Higher utilization is valuable only while per-request latency remains bounded.

Description: The bars compare Static batch with Continuous batch on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Completed rows are replaced without draining the batch.”, remains larger than the risk, “Greedy admission can starve decode or long prompts.”, under production traffic.

Memory admission precedes compute admission. A request that fits the next token budget may not fit future KV growth. Reserve prompt blocks plus a bounded decode allowance, then update reservations as output progresses. Reject or queue before allocator exhaustion.

The scheduler coordinates heterogeneous sequence phases inside one iteration loop.

Description: State advances from Waiting to Prefilling, Decoding, and finally Finished. The labels below each state identify what becomes true at that boundary. The governing invariant is: A sequence may move from prefill to decode only after its prompt state is complete. Retries and cancellation must preserve the same transition rules.

Continuous batching is a multi-objective online scheduler.

Description: The four panels are independent review axes: Goodput, Latency, Fairness, and Memory. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Expose per-iteration prefill tokens, decode tokens, and admission reason.

Maximum throughput is not goodput when active streams stop making progress.

Description: This is a causal chain, not four unrelated symptoms. Long prompts arrive triggers Decode waits, which creates Retries begin. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Continuous batching turns generation into a living schedule. Its value comes from dense GPU work; its quality comes from fair admission and honest memory accounting.

16/20 - Streaming Generation: The First Token Is a Product Decision 18/20 - Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else