Continuous Batching: The GPU Schedule That Never Stands Still

#continuous-batching #iteration-level-scheduling #llm-serving #scheduler #vllm

A static batch waits for its longest sequence. A continuous batch changes membership after every decoding iteration: finished requests leave, waiting requests enter, and the GPU keeps working on the largest useful set of active sequences.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A shuttle that waits for every passenger’s entire vacation is inefficient. Continuous batching behaves like a metro: riders leave at each stop and new riders board, while the train keeps moving.

Follow the state and work from left to right.

What actually happens

Iteration-level scheduling runs one model iteration for the current active set, updates each sequence, retires completed or cancelled requests, and admits new work before the next iteration. Request lifetimes no longer define fixed batch boundaries.

The scheduler budgets tokens and KV blocks, not merely request count. A long prompt can consume far more work than one decode token, and a nearly full KV pool may prevent admission even when compute appears idle.

Preemption becomes possible when high-priority work arrives or memory is exhausted. The engine may swap KV state, recompute evicted prefixes, or pause lower-priority sequences. Each choice changes latency and bandwidth.

A worked example

Four sequences start together with output lengths 8, 20, 60, and 100. A static batch carries empty slots after the short sequences finish. Continuous batching replaces each completed sequence with a waiting request at the next iteration, keeping the effective batch dense.

The performance model

Throughput improves by reducing idle batch slots. Tail latency depends on admission, fairness, maximum active tokens, and preemption. A throughput-maximizing scheduler can starve large or low-priority requests without explicit policy.

Expert lens

Token budgets are usually more stable than sequence-count limits. Decode cost scales with active sequences and context lengths, while prefill cost scales with prompt tokens. Modern schedulers often maintain separate budgets or chunk prefills to fit decode work around them.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Online generation with varied output lengths
Steady concurrent traffic
Paged KV systems that support flexible admission

Where it disappoints

Limiting only by request count
Ignoring fairness under short-job traffic
Preempting without accounting for recompute cost
Mixing long prefills into decode iterations blindly

Production checklist

Set token and KV admission budgets
Define fairness and priority policy
Bound preemption and recomputation
Propagate cancellation before next iteration
Stress mixed prompt and output lengths

What to measure

Active sequences and tokens per iteration
Admission wait and starvation age
Preemptions, swaps, and recomputed tokens
Batch occupancy over time
TPOT by priority class

From one GPU to a production service

One engine scheduler can use local queue age. A fleet must also route new requests among replicas. The global router should consider queue, KV locality, model capability, and admission signals without attempting to micromanage token iterations.

Autoscaling has delayed effects. A new replica must load weights and warm kernels before it drains queue. Meanwhile, aggressive routing to the least-loaded cold replica may worsen latency. Publish readiness and capacity separately.

Fairness spans retries. If a preempted request returns to the head of the queue indefinitely, it can dominate capacity; if it returns to the tail, it may starve. Preserve service attained and age across preemption.

Design-review questions

Which decisions belong to global routing versus local scheduling?
How is fairness preserved across preemption?
What capacity signal excludes cold replicas?
Can adapters and sampling modes share iterations?
Which overload limit rejects before KV exhaustion?

How it connects to the rest of the series

Dynamic batching groups arrivals before an execution. Continuous batching changes membership during generation. PagedAttention makes the KV allocation flexible enough for this scheduler.

From equation to implementation

At each iteration the scheduler solves a constrained packing problem: maximize useful token work subject to KV blocks, token budget, sequence count, adapter compatibility, and deadlines. Exact optimization is too expensive, so engines use greedy priority and admission heuristics.

Fairness can be expressed through virtual time, age, deficit tokens, or priority deadlines. Shortest-job-like policies improve mean latency but can starve long generations. Production schedulers need a measurable fairness contract.

Implementation sketch

while running:
    retire_finished_and_cancelled(active)
    reclaim_kv_blocks()
    waiting.update_age_and_deadlines()
    budget = iteration_token_budget
    admit_decode_tokens(active, budget)
    admit_prefill_chunks(waiting, remaining(budget))
    reserve_all_required_kv()
    outputs = engine.step(active_batch)
    update_sequence_state(outputs)

Capacity planning

Maximum active sequences is less useful than maximum scheduled tokens and maximum KV blocks. Reserve capacity for one iteration of growth, speculative branches if enabled, and emergency admission for high-priority traffic.

Benchmarking without fooling yourself

Use a trace with mixed arrivals and output lengths.
Plot latency by request size and priority, not only aggregate p99.
Force block pressure to exercise preemption.
Compare useful tokens per iteration with scheduled token slots.

A production failure to design for

A policy always admits short prompts first. Under steady chat traffic, a long document request remains queued for minutes despite available partial capacity. Add aging or deficit-based fairness and alert on oldest-request age.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Continuous batching turns generation into a living schedule. Its value comes from dense GPU work; its quality comes from fair admission and honest memory accounting.

Streaming Generation: The First Token Is a Product Decision Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else