14/20 - Dynamic Batching: Waiting Microseconds to Save Milliseconds

#dynamic-batching #scheduler #triton #latency #throughput

Sending every request to a GPU immediately feels fast, but tiny matrix operations often waste most of the accelerator. Dynamic batching deliberately waits for a small window so compatible arrivals can share one larger execution.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

An elevator can close the instant one passenger enters or wait two seconds for three people walking toward it. The small delay may reduce total travel and energy, but waiting too long ruins the service.

Follow the state and work from left to right.

Description: Start with Live requests, where enter model queue. The middle stage, Dynamic batcher, group compatible shapes. The final stage, GPU execution, shows the observable result: run one larger call. The arrows describe dependency order, not necessarily separate services.

What actually happens

A dynamic batcher maintains a queue per model or execution profile. When an instance becomes available, it forms a batch from compatible requests, bounded by maximum batch size, preferred sizes, queue delay, priority, and timeout policies.

Compatibility includes tensor shapes, dtypes, model version, adapters, and request options. Ragged batching or padding can combine variable lengths, but padded tokens are real work unless the backend supports packed representations.

Queue delay is a latency investment. At high arrival rates the batch fills immediately. At low rates, the configured maximum delay becomes visible. Different service classes may therefore need different queues or delay budgets.

A worked example

Eight requests arrive within 80 microseconds. Running eight batch-one executions costs eight launches and underfills the GPU. Waiting up to 100 microseconds may produce one batch of eight whose execution is far shorter than the sum of eight separate calls, improving both throughput and completion time.

The performance model

Total latency equals queue wait plus batched execution plus response split. Throughput usually rises with batch size until memory or compute saturation. Tail latency can worsen if the scheduler chases a preferred batch that traffic cannot fill.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Dynamic batching changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Preferred batch sizes should correspond to measured engine performance cliffs, not round numbers. If TensorRT profiles or kernels are equally efficient across sizes, forcing a preferred size can add needless delay.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Immediate dispatch, characterized by near-zero scheduler wait and small underfilled kernels. The right panel applies Dynamic batching, changing the cost profile to short bounded queue wait and larger efficient kernels. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Stateless inference with bursty compatible traffic
Embedding, reranking, and fixed-shape models
Online systems with modest queue-delay budgets

Where it disappoints

Setting delay without an SLO budget
Padding long and short requests together
Forcing preferred sizes that never arrive
Mixing priority classes in one FIFO queue

Production checklist

Benchmark latency across arrival rates
Set maximum delay from the tail SLO
Bucket or pack variable lengths
Configure queue timeout and rejection behavior
Test priority fairness and starvation

What to measure

Batch-size histogram
Queue delay percentiles
Padding or ragged-token ratio
GPU utilization and launch count
Timeouts, drops, and priority wait

From one GPU to a production service

A single-model server can use one queue. A multi-tenant gateway needs queues by model, adapter, priority, and sometimes shape profile. Global admission should prevent one model’s backlog from consuming all host memory before per-model batching begins.

Deadlines should move with the request. The gateway can pass an absolute deadline; the batcher subtracts time already spent in authentication, routing, and network transit. A fresh local delay budget would otherwise violate the end-to-end SLO.

Autoscaling interacts with batch size. Adding replicas reduces queue depth and may shrink batches, so aggregate throughput can improve less than expected. Scale on SLO and goodput, not queue length alone.

Design-review questions

Which fields make requests batch-compatible?
Is queue delay budget end-to-end or local?
Can one priority class starve another?
Does autoscaling destroy efficient batch sizes?
What overload response prevents unbounded waiting?

How it connects to the rest of the series

Continuous batching operates at token-iteration granularity for generative models. Batch inference is a durable offline workflow. Chunked prefill changes which token work can join a live batch.

From equation to implementation

At arrival rate lambda requests per second, a rough expected time to collect b requests is (b - 1)/lambda when arrivals are Poisson, before considering an already queued population. This explains why preferred batch sizes fill effortlessly at high load and cause visible delay at low load.

A scheduler should dispatch when any of three conditions is met: an efficient batch is ready, the oldest request approaches its delay budget, or an instance would otherwise idle beyond the expected gain. Static delay alone cannot react to changing traffic.

Implementation sketch

on_instance_ready():
    candidates = queue.compatible_with(instance.profile)
    batch = pack_by_shape_and_token_budget(candidates)
    if batch.is_efficient() or oldest(batch).near_deadline():
        dispatch(batch)
    elif queue.empty():
        idle_briefly()
    else:
        arm_timer(min(queue_deadline, collection_window))

Capacity planning

Queue capacity must be derived from acceptable waiting time and service rate. An unbounded batch queue converts overload into latency and memory growth. Use per-priority limits, deadlines, and explicit rejection before the system enters a retry storm.

Benchmarking without fooling yourself

Replay sparse, steady, bursty, and overload arrival patterns.
Report latency versus offered load, not one concurrency point.
Measure batch efficiency and queue wait separately.
Test incompatible shapes and priority classes.

A production failure to design for

A preferred batch of 32 works in load tests. Overnight traffic falls, so requests wait the full 20 ms collection window even though batch 4 would meet throughput needs. Make preferred sizes opportunistic and dispatch on deadlines.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Characterize to Configure, then Load test and Adapt. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

A dynamic batcher is a deadline-aware packing scheduler. It groups only requests compatible in model revision, adapter, dtype, shape profile, and execution options. Dispatch occurs when an efficient batch is ready, the oldest deadline approaches, or waiting would leave the device idle longer than the expected batching gain.

Dynamic batching deliberately spends a bounded amount of latency to buy efficient work.

Description: Follow the state from Enqueue through Collect and Pack to Dispatch. Each box is an ownership or computation boundary. In particular, the queue uses remaining end-to-end budget, not a fresh local timeout. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Batch size is an incomplete metric for variable-length models. Count padded or packed tokens, tensor shape, and expected execution time. A batch of eight long prompts can be far larger than a batch of 64 short embeddings. Token-budget packing reduces OOM risk and makes capacity limits portable across traffic mixes.

The collection window must come from the request latency budget.

Description: The bars compare Immediate tiny calls with Bounded collection on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Larger kernels improve utilization and throughput.”, remains larger than the risk, “Sparse traffic may pay the full collection window.”, under production traffic.

Priority requires aging or reserved service. Strict priority can starve background work; plain FIFO lets large low-value requests block urgent ones. Maintain separate class budgets, promote old requests gradually, and reject before deadlines become impossible rather than dispatching doomed work.

A clear state boundary prevents races between timers, cancellation, and GPU launch.

Description: State advances from Queued to Compatible, Packed, and finally Dispatched. The labels below each state identify what becomes true at that boundary. The governing invariant is: Cancellation before dispatch releases capacity; after dispatch it suppresses delivery only. Retries and cancellation must preserve the same transition rules.

Observable dispatch reasons make latency and utilization tradeoffs tunable.

Description: The four panels are independent review axes: Deadline, Compatibility, Efficiency, and Fairness. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Export why each batch dispatched: full, deadline, idle, or policy.

The batcher must become more eager as traffic becomes sparse.

Description: This is a causal chain, not four unrelated symptoms. Traffic drops triggers Oldest ages, which creates Tail rises. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Dynamic batching makes waiting productive. The correct delay is the smallest one that buys a materially better execution shape.

13/20 - Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model 15/20 - Memory Offloading: Trading Bandwidth for Capacity