Continuous Batching: The GPU Schedule That Never Stands Still
A static batch waits for its longest sequence. A continuous batch changes membership after every decoding iteration: finished requests leave, waiting requests enter, and the GPU keeps working on the largest useful set of active sequences.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A shuttle that waits for every passenger’s entire vacation is inefficient. Continuous batching behaves like a metro: riders leave at each stop and new riders board, while the train keeps moving.
What actually happens
Iteration-level scheduling runs one model iteration for the current active set, updates each sequence, retires completed or cancelled requests, and admits new work before the next iteration. Request lifetimes no longer define fixed batch boundaries.
The scheduler budgets tokens and KV blocks, not merely request count. A long prompt can consume far more work than one decode token, and a nearly full KV pool may prevent admission even when compute appears idle.
Preemption becomes possible when high-priority work arrives or memory is exhausted. The engine may swap KV state, recompute evicted prefixes, or pause lower-priority sequences. Each choice changes latency and bandwidth.
A worked example
Four sequences start together with output lengths 8, 20, 60, and 100. A static batch carries empty slots after the short sequences finish. Continuous batching replaces each completed sequence with a waiting request at the next iteration, keeping the effective batch dense.
The performance model
Throughput improves by reducing idle batch slots. Tail latency depends on admission, fairness, maximum active tokens, and preemption. A throughput-maximizing scheduler can starve large or low-priority requests without explicit policy.
Expert lens
Token budgets are usually more stable than sequence-count limits. Decode cost scales with active sequences and context lengths, while prefill cost scales with prompt tokens. Modern schedulers often maintain separate budgets or chunk prefills to fit decode work around them.
Where it wins
- Online generation with varied output lengths
- Steady concurrent traffic
- Paged KV systems that support flexible admission
Where it disappoints
- Limiting only by request count
- Ignoring fairness under short-job traffic
- Preempting without accounting for recompute cost
- Mixing long prefills into decode iterations blindly
Production checklist
- Set token and KV admission budgets
- Define fairness and priority policy
- Bound preemption and recomputation
- Propagate cancellation before next iteration
- Stress mixed prompt and output lengths
What to measure
- Active sequences and tokens per iteration
- Admission wait and starvation age
- Preemptions, swaps, and recomputed tokens
- Batch occupancy over time
- TPOT by priority class
From one GPU to a production service
One engine scheduler can use local queue age. A fleet must also route new requests among replicas. The global router should consider queue, KV locality, model capability, and admission signals without attempting to micromanage token iterations.
Autoscaling has delayed effects. A new replica must load weights and warm kernels before it drains queue. Meanwhile, aggressive routing to the least-loaded cold replica may worsen latency. Publish readiness and capacity separately.
Fairness spans retries. If a preempted request returns to the head of the queue indefinitely, it can dominate capacity; if it returns to the tail, it may starve. Preserve service attained and age across preemption.
Design-review questions
- Which decisions belong to global routing versus local scheduling?
- How is fairness preserved across preemption?
- What capacity signal excludes cold replicas?
- Can adapters and sampling modes share iterations?
- Which overload limit rejects before KV exhaustion?
How it connects to the rest of the series
Dynamic batching groups arrivals before an execution. Continuous batching changes membership during generation. PagedAttention makes the KV allocation flexible enough for this scheduler.
From equation to implementation
At each iteration the scheduler solves a constrained packing problem: maximize useful token work subject to KV blocks, token budget, sequence count, adapter compatibility, and deadlines. Exact optimization is too expensive, so engines use greedy priority and admission heuristics.
Fairness can be expressed through virtual time, age, deficit tokens, or priority deadlines. Shortest-job-like policies improve mean latency but can starve long generations. Production schedulers need a measurable fairness contract.
Implementation sketch
while running:
retire_finished_and_cancelled(active)
reclaim_kv_blocks()
waiting.update_age_and_deadlines()
budget = iteration_token_budget
admit_decode_tokens(active, budget)
admit_prefill_chunks(waiting, remaining(budget))
reserve_all_required_kv()
outputs = engine.step(active_batch)
update_sequence_state(outputs)Capacity planning
Maximum active sequences is less useful than maximum scheduled tokens and maximum KV blocks. Reserve capacity for one iteration of growth, speculative branches if enabled, and emergency admission for high-priority traffic.
Benchmarking without fooling yourself
- Use a trace with mixed arrivals and output lengths.
- Plot latency by request size and priority, not only aggregate p99.
- Force block pressure to exercise preemption.
- Compare useful tokens per iteration with scheduled token slots.
A production failure to design for
A policy always admits short prompts first. Under steady chat traffic, a long document request remains queued for minutes despite available partial capacity. Add aging or deficit-based fairness and alert on oldest-request age.
Primary references
The takeaway
Continuous batching turns generation into a living schedule. Its value comes from dense GPU work; its quality comes from fair admission and honest memory accounting.
