Dynamic Batching: Waiting Microseconds to Save Milliseconds

#dynamic-batching #scheduler #triton #latency #throughput

Sending every request to a GPU immediately feels fast, but tiny matrix operations often waste most of the accelerator. Dynamic batching deliberately waits for a small window so compatible arrivals can share one larger execution.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

An elevator can close the instant one passenger enters or wait two seconds for three people walking toward it. The small delay may reduce total travel and energy, but waiting too long ruins the service.

Follow the state and work from left to right.

What actually happens

A dynamic batcher maintains a queue per model or execution profile. When an instance becomes available, it forms a batch from compatible requests, bounded by maximum batch size, preferred sizes, queue delay, priority, and timeout policies.

Compatibility includes tensor shapes, dtypes, model version, adapters, and request options. Ragged batching or padding can combine variable lengths, but padded tokens are real work unless the backend supports packed representations.

Queue delay is a latency investment. At high arrival rates the batch fills immediately. At low rates, the configured maximum delay becomes visible. Different service classes may therefore need different queues or delay budgets.

A worked example

Eight requests arrive within 80 microseconds. Running eight batch-one executions costs eight launches and underfills the GPU. Waiting up to 100 microseconds may produce one batch of eight whose execution is far shorter than the sum of eight separate calls, improving both throughput and completion time.

The performance model

Total latency equals queue wait plus batched execution plus response split. Throughput usually rises with batch size until memory or compute saturation. Tail latency can worsen if the scheduler chases a preferred batch that traffic cannot fill.

Expert lens

Preferred batch sizes should correspond to measured engine performance cliffs, not round numbers. If TensorRT profiles or kernels are equally efficient across sizes, forcing a preferred size can add needless delay.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Stateless inference with bursty compatible traffic
Embedding, reranking, and fixed-shape models
Online systems with modest queue-delay budgets

Where it disappoints

Setting delay without an SLO budget
Padding long and short requests together
Forcing preferred sizes that never arrive
Mixing priority classes in one FIFO queue

Production checklist

Benchmark latency across arrival rates
Set maximum delay from the tail SLO
Bucket or pack variable lengths
Configure queue timeout and rejection behavior
Test priority fairness and starvation

What to measure

Batch-size histogram
Queue delay percentiles
Padding or ragged-token ratio
GPU utilization and launch count
Timeouts, drops, and priority wait

From one GPU to a production service

A single-model server can use one queue. A multi-tenant gateway needs queues by model, adapter, priority, and sometimes shape profile. Global admission should prevent one model’s backlog from consuming all host memory before per-model batching begins.

Deadlines should move with the request. The gateway can pass an absolute deadline; the batcher subtracts time already spent in authentication, routing, and network transit. A fresh local delay budget would otherwise violate the end-to-end SLO.

Autoscaling interacts with batch size. Adding replicas reduces queue depth and may shrink batches, so aggregate throughput can improve less than expected. Scale on SLO and goodput, not queue length alone.

Design-review questions

Which fields make requests batch-compatible?
Is queue delay budget end-to-end or local?
Can one priority class starve another?
Does autoscaling destroy efficient batch sizes?
What overload response prevents unbounded waiting?

How it connects to the rest of the series

Continuous batching operates at token-iteration granularity for generative models. Batch inference is a durable offline workflow. Chunked prefill changes which token work can join a live batch.

From equation to implementation

At arrival rate lambda requests per second, a rough expected time to collect b requests is (b - 1)/lambda when arrivals are Poisson, before considering an already queued population. This explains why preferred batch sizes fill effortlessly at high load and cause visible delay at low load.

A scheduler should dispatch when any of three conditions is met: an efficient batch is ready, the oldest request approaches its delay budget, or an instance would otherwise idle beyond the expected gain. Static delay alone cannot react to changing traffic.

Implementation sketch

on_instance_ready():
    candidates = queue.compatible_with(instance.profile)
    batch = pack_by_shape_and_token_budget(candidates)
    if batch.is_efficient() or oldest(batch).near_deadline():
        dispatch(batch)
    elif queue.empty():
        idle_briefly()
    else:
        arm_timer(min(queue_deadline, collection_window))

Capacity planning

Queue capacity must be derived from acceptable waiting time and service rate. An unbounded batch queue converts overload into latency and memory growth. Use per-priority limits, deadlines, and explicit rejection before the system enters a retry storm.

Benchmarking without fooling yourself

Replay sparse, steady, bursty, and overload arrival patterns.
Report latency versus offered load, not one concurrency point.
Measure batch efficiency and queue wait separately.
Test incompatible shapes and priority classes.

A production failure to design for

A preferred batch of 32 works in load tests. Overnight traffic falls, so requests wait the full 20 ms collection window even though batch 4 would meet throughput needs. Make preferred sizes opportunistic and dispatch on deadlines.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Dynamic batching makes waiting productive. The correct delay is the smallest one that buys a materially better execution shape.

Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model Memory Offloading: Trading Bandwidth for Capacity