Skip to content
14/20 - Dynamic Batching: Waiting Microseconds to Save Milliseconds

14/20 - Dynamic Batching: Waiting Microseconds to Save Milliseconds

Sending every request to a GPU immediately feels fast, but tiny matrix operations often waste most of the accelerator. Dynamic batching deliberately waits for a small window so compatible arrivals can share one larger execution.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

An elevator can close the instant one passenger enters or wait two seconds for three people walking toward it. The small delay may reduce total travel and energy, but waiting too long ruins the service.

MECHANISM FLOWDynamic Batching: request path01Live requestsEnter model queueTrack delay budget02Dynamic batcherGroup compatible shapesChoose batch size03GPU executionRun one larger callSplit responsesINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Live requests, where enter model queue. The middle stage, Dynamic batcher, group compatible shapes. The final stage, GPU execution, shows the observable result: run one larger call. The arrows describe dependency order, not necessarily separate services.

What actually happens

A dynamic batcher maintains a queue per model or execution profile. When an instance becomes available, it forms a batch from compatible requests, bounded by maximum batch size, preferred sizes, queue delay, priority, and timeout policies.

Compatibility includes tensor shapes, dtypes, model version, adapters, and request options. Ragged batching or padding can combine variable lengths, but padded tokens are real work unless the backend supports packed representations.

Queue delay is a latency investment. At high arrival rates the batch fills immediately. At low rates, the configured maximum delay becomes visible. Different service classes may therefore need different queues or delay budgets.

A worked example

Eight requests arrive within 80 microseconds. Running eight batch-one executions costs eight launches and underfills the GPU. Waiting up to 100 microseconds may produce one batch of eight whose execution is far shorter than the sum of eight separate calls, improving both throughput and completion time.

The performance model

Total latency equals queue wait plus batched execution plus response split. Throughput usually rises with batch size until memory or compute saturation. Tail latency can worsen if the scheduler chases a preferred batch that traffic cannot fill.

PHASE FITWhere Dynamic batching changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityCombines compatible prompt workDECODEOne new token per iterationWeight and KV bandwidth pressureMay group compatible decode stepsPROVE IT WITHQueue delay and useful tokens/secDEPLOYMENT DECISIONSpend only bounded SLO slack
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Dynamic batching changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Preferred batch sizes should correspond to measured engine performance cliffs, not round numbers. If TensorRT profiles or kernels are equally efficient across sizes, forcing a preferred size can add needless delay.

TRADE-OFF MAPDynamic Batching: the tradeoffBASELINEImmediate dispatchNear-zero scheduler waitSmall underfilled kernelsMore launchesGood at sparse trafficVSOPTIMIZEDDynamic batchingShort bounded queue waitLarger efficient kernelsFewer launchesBest with steady arrivalsMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Immediate dispatch, characterized by near-zero scheduler wait and small underfilled kernels. The right panel applies Dynamic batching, changing the cost profile to short bounded queue wait and larger efficient kernels. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Stateless inference with bursty compatible traffic
  • Embedding, reranking, and fixed-shape models
  • Online systems with modest queue-delay budgets

Where it disappoints

  • Setting delay without an SLO budget
  • Padding long and short requests together
  • Forcing preferred sizes that never arrive
  • Mixing priority classes in one FIFO queue

Production checklist

  • Benchmark latency across arrival rates
  • Set maximum delay from the tail SLO
  • Bucket or pack variable lengths
  • Configure queue timeout and rejection behavior
  • Test priority fairness and starvation

What to measure

  • Batch-size histogram
  • Queue delay percentiles
  • Padding or ragged-token ratio
  • GPU utilization and launch count
  • Timeouts, drops, and priority wait

From one GPU to a production service

A single-model server can use one queue. A multi-tenant gateway needs queues by model, adapter, priority, and sometimes shape profile. Global admission should prevent one model’s backlog from consuming all host memory before per-model batching begins.

Deadlines should move with the request. The gateway can pass an absolute deadline; the batcher subtracts time already spent in authentication, routing, and network transit. A fresh local delay budget would otherwise violate the end-to-end SLO.

Autoscaling interacts with batch size. Adding replicas reduces queue depth and may shrink batches, so aggregate throughput can improve less than expected. Scale on SLO and goodput, not queue length alone.

Design-review questions

  • Which fields make requests batch-compatible?
  • Is queue delay budget end-to-end or local?
  • Can one priority class starve another?
  • Does autoscaling destroy efficient batch sizes?
  • What overload response prevents unbounded waiting?

How it connects to the rest of the series

Continuous batching operates at token-iteration granularity for generative models. Batch inference is a durable offline workflow. Chunked prefill changes which token work can join a live batch.

From equation to implementation

At arrival rate lambda requests per second, a rough expected time to collect b requests is (b - 1)/lambda when arrivals are Poisson, before considering an already queued population. This explains why preferred batch sizes fill effortlessly at high load and cause visible delay at low load.

A scheduler should dispatch when any of three conditions is met: an efficient batch is ready, the oldest request approaches its delay budget, or an instance would otherwise idle beyond the expected gain. Static delay alone cannot react to changing traffic.

Implementation sketch

on_instance_ready():
    candidates = queue.compatible_with(instance.profile)
    batch = pack_by_shape_and_token_budget(candidates)
    if batch.is_efficient() or oldest(batch).near_deadline():
        dispatch(batch)
    elif queue.empty():
        idle_briefly()
    else:
        arm_timer(min(queue_deadline, collection_window))

Capacity planning

Queue capacity must be derived from acceptable waiting time and service rate. An unbounded batch queue converts overload into latency and memory growth. Use per-priority limits, deadlines, and explicit rejection before the system enters a retry storm.

Benchmarking without fooling yourself

  • Replay sparse, steady, bursty, and overload arrival patterns.
  • Report latency versus offered load, not one concurrency point.
  • Measure batch efficiency and queue wait separately.
  • Test incompatible shapes and priority classes.

A production failure to design for

A preferred batch of 32 works in load tests. Overnight traffic falls, so requests wait the full 20 ms collection window even though batch 4 would meet throughput needs. Make preferred sizes opportunistic and dispatch on deadlines.

OPERATING LOOPOperational loop1CharacterizeArrival and shapesLatency budget2ConfigureToken and delay limitsPriority queues3Load testSparse through overloadTail latency4AdaptDispatch thresholdsReject before collapseMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Characterize to Configure, then Load test and Adapt. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

A dynamic batcher is a deadline-aware packing scheduler. It groups only requests compatible in model revision, adapter, dtype, shape profile, and execution options. Dispatch occurs when an efficient batch is ready, the oldest deadline approaches, or waiting would leave the device idle longer than the expected batching gain.

A deadline-aware batching cycleEnqueueAttach absolute deadlineClassify compatibilityTrack priorityCollectBuild candidate bucketEstimate paddingWatch oldest waitPackFit token budgetChoose engine profileProtect fairnessDispatchRun one GPU callSplit outputsCharge each requestThe queue uses remaining end-to-end budget, not a fresh local timeout.
Dynamic batching deliberately spends a bounded amount of latency to buy efficient work.

How to read this diagram: Follow the state from Enqueue through Collect and Pack to Dispatch. Each box is an ownership or computation boundary. In particular, the queue uses remaining end-to-end budget, not a fresh local timeout. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Batch size is an incomplete metric for variable-length models. Count padded or packed tokens, tensor shape, and expected execution time. A batch of eight long prompts can be far larger than a batch of 64 short embeddings. Token-budget packing reduces OOM risk and makes capacity limits portable across traffic mixes.

Small wait can reduce total completion timeImmediate tiny callsmany launchesBounded collectionlarger kernelQueue riskSparse traffic may pay the full collection window.Compute gainLarger kernels improve utilization and throughput.
The collection window must come from the request latency budget.

How to read this diagram: The bars compare Immediate tiny calls with Bounded collection on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Larger kernels improve utilization and throughput.”, remains larger than the risk, “Sparse traffic may pay the full collection window.”, under production traffic.

Priority requires aging or reserved service. Strict priority can starve background work; plain FIFO lets large low-value requests block urgent ones. Maintain separate class budgets, promote old requests gradually, and reject before deadlines become impossible rather than dispatching doomed work.

Request state inside a dynamic batcherQueueddeadline is tickingCompatiblebucket selectedPackedcapacity reservedDispatchedbatch is immutableCancellation before dispatch releases capacity; after dispatch it suppresses delivery only.
A clear state boundary prevents races between timers, cancellation, and GPU launch.

How to read this diagram: State advances from Queued to Compatible, Packed, and finally Dispatched. The labels below each state identify what becomes true at that boundary. The governing invariant is: Cancellation before dispatch releases capacity; after dispatch it suppresses delivery only. Retries and cancellation must preserve the same transition rules.

Four inputs to a batching decisionDeadlineRemaining SLO budgetOldest request waitCompatibilityModel, adapter, shapeGeneration optionsEfficiencyToken and padding costPreferred engine sizesFairnessTenant and priorityAging and reservationsExport why each batch dispatched: full, deadline, idle, or policy.
Observable dispatch reasons make latency and utilization tradeoffs tunable.

How to read this diagram: The four panels are independent review axes: Deadline, Compatibility, Efficiency, and Fairness. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Export why each batch dispatched: full, deadline, idle, or policy.

A preferred batch size causes head-of-line delayTraffic dropsPreferred size never fillsTimer keeps waitingOldest agesDeadline shrinksSmall jobs stay blockedTail risesSLO misses growRetries add arrivalsControlDispatch on deadlineAdapt target sizePreferred sizes are optimization hints, never prerequisites for execution.
The batcher must become more eager as traffic becomes sparse.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Traffic drops triggers Oldest ages, which creates Tail rises. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Dynamic batching makes waiting productive. The correct delay is the smallest one that buys a materially better execution shape.