Dynamic Batching: Waiting Microseconds to Save Milliseconds
Sending every request to a GPU immediately feels fast, but tiny matrix operations often waste most of the accelerator. Dynamic batching deliberately waits for a small window so compatible arrivals can share one larger execution.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
An elevator can close the instant one passenger enters or wait two seconds for three people walking toward it. The small delay may reduce total travel and energy, but waiting too long ruins the service.
What actually happens
A dynamic batcher maintains a queue per model or execution profile. When an instance becomes available, it forms a batch from compatible requests, bounded by maximum batch size, preferred sizes, queue delay, priority, and timeout policies.
Compatibility includes tensor shapes, dtypes, model version, adapters, and request options. Ragged batching or padding can combine variable lengths, but padded tokens are real work unless the backend supports packed representations.
Queue delay is a latency investment. At high arrival rates the batch fills immediately. At low rates, the configured maximum delay becomes visible. Different service classes may therefore need different queues or delay budgets.
A worked example
Eight requests arrive within 80 microseconds. Running eight batch-one executions costs eight launches and underfills the GPU. Waiting up to 100 microseconds may produce one batch of eight whose execution is far shorter than the sum of eight separate calls, improving both throughput and completion time.
The performance model
Total latency equals queue wait plus batched execution plus response split. Throughput usually rises with batch size until memory or compute saturation. Tail latency can worsen if the scheduler chases a preferred batch that traffic cannot fill.
Expert lens
Preferred batch sizes should correspond to measured engine performance cliffs, not round numbers. If TensorRT profiles or kernels are equally efficient across sizes, forcing a preferred size can add needless delay.
Where it wins
- Stateless inference with bursty compatible traffic
- Embedding, reranking, and fixed-shape models
- Online systems with modest queue-delay budgets
Where it disappoints
- Setting delay without an SLO budget
- Padding long and short requests together
- Forcing preferred sizes that never arrive
- Mixing priority classes in one FIFO queue
Production checklist
- Benchmark latency across arrival rates
- Set maximum delay from the tail SLO
- Bucket or pack variable lengths
- Configure queue timeout and rejection behavior
- Test priority fairness and starvation
What to measure
- Batch-size histogram
- Queue delay percentiles
- Padding or ragged-token ratio
- GPU utilization and launch count
- Timeouts, drops, and priority wait
From one GPU to a production service
A single-model server can use one queue. A multi-tenant gateway needs queues by model, adapter, priority, and sometimes shape profile. Global admission should prevent one model’s backlog from consuming all host memory before per-model batching begins.
Deadlines should move with the request. The gateway can pass an absolute deadline; the batcher subtracts time already spent in authentication, routing, and network transit. A fresh local delay budget would otherwise violate the end-to-end SLO.
Autoscaling interacts with batch size. Adding replicas reduces queue depth and may shrink batches, so aggregate throughput can improve less than expected. Scale on SLO and goodput, not queue length alone.
Design-review questions
- Which fields make requests batch-compatible?
- Is queue delay budget end-to-end or local?
- Can one priority class starve another?
- Does autoscaling destroy efficient batch sizes?
- What overload response prevents unbounded waiting?
How it connects to the rest of the series
Continuous batching operates at token-iteration granularity for generative models. Batch inference is a durable offline workflow. Chunked prefill changes which token work can join a live batch.
From equation to implementation
At arrival rate lambda requests per second, a rough expected time to collect b requests is (b - 1)/lambda when arrivals are Poisson, before considering an already queued population. This explains why preferred batch sizes fill effortlessly at high load and cause visible delay at low load.
A scheduler should dispatch when any of three conditions is met: an efficient batch is ready, the oldest request approaches its delay budget, or an instance would otherwise idle beyond the expected gain. Static delay alone cannot react to changing traffic.
Implementation sketch
on_instance_ready():
candidates = queue.compatible_with(instance.profile)
batch = pack_by_shape_and_token_budget(candidates)
if batch.is_efficient() or oldest(batch).near_deadline():
dispatch(batch)
elif queue.empty():
idle_briefly()
else:
arm_timer(min(queue_deadline, collection_window))Capacity planning
Queue capacity must be derived from acceptable waiting time and service rate. An unbounded batch queue converts overload into latency and memory growth. Use per-priority limits, deadlines, and explicit rejection before the system enters a retry storm.
Benchmarking without fooling yourself
- Replay sparse, steady, bursty, and overload arrival patterns.
- Report latency versus offered load, not one concurrency point.
- Measure batch efficiency and queue wait separately.
- Test incompatible shapes and priority classes.
A production failure to design for
A preferred batch of 32 works in load tests. Overnight traffic falls, so requests wait the full 20 ms collection window even though batch 4 would meet throughput needs. Make preferred sizes opportunistic and dispatch on deadlines.
Primary references
The takeaway
Dynamic batching makes waiting productive. The correct delay is the smallest one that buys a materially better execution shape.
