Skip to content
Batch Inference: When Throughput Matters More Than Immediacy

Batch Inference: When Throughput Matters More Than Immediacy

Not every inference request needs a blinking cursor and a token stream. Evaluating a million documents, enriching a catalog, generating embeddings, or running nightly moderation is a job, not a conversation. Batch inference optimizes for completed useful work per hour and per dollar.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Interactive inference is a taxi: leave now, even with one passenger. Batch inference is a train: group compatible work, fill the hardware, and accept a schedule in exchange for efficiency.

Batch Inference: request pathInput manifestValidate and bucketAssign stable IDsBatch workerPack compatible itemsRun large batchesResult storeWrite per-item statusRetry only failures
Follow the state and work from left to right.

What actually happens

A robust batch system separates submission, durable job state, scheduling, execution, and result storage. The API should acknowledge the job quickly. Workers claim partitions, process bounded chunks, checkpoint progress, and write results with stable item identifiers.

Length bucketing matters for language workloads. Padding every prompt to the longest item wastes compute and memory. Grouping similar prompt lengths and separating generation limits produces denser batches without silently changing per-item sampling configuration.

Retries must be item-aware and idempotent. Replaying an entire 10,000-item job because three calls failed duplicates spend and complicates output ordering. Persist terminal status, attempt count, model revision, prompt version, and output checksum per item.

A worked example

A dataset has 100,000 prompts: half near 200 tokens and half near 3,000. One mixed padded batch can charge most short prompts as if they were long. Two length buckets dramatically reduce padded tokens. If each item has a stable key, a worker restart resumes from unfinished items rather than regenerating completed outputs.

The performance model

Throughput depends on useful tokens per accelerator-second, not raw batch size. Very large batches can exceed memory, increase tail completion time, or reduce flexibility. Optimize batch size, token budget, and worker count jointly.

Expert lens

Offline work can consume every spare GPU cycle and still damage online SLOs if both share a model pool. Use resource classes, quotas, priorities, and preemption boundaries. The cheapest batch run is not cheap if it causes customer-facing latency or cache eviction.

Batch Inference: the tradeoffInteractive inferenceOptimize TTFT and TPOTReturn one request nowSmall elastic batchesClient holds connectionBatch inferenceOptimize jobs per hourDurable asynchronous workLength-bucketed large batchesResults fetched later
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

  • Large evaluations and dataset enrichment
  • Embedding, reranking, and moderation pipelines
  • Cost-sensitive work with flexible completion deadlines

Where it disappoints

  • Using an HTTP timeout as a batch-job controller
  • Padding mixed lengths without accounting
  • Retrying whole jobs instead of failed items
  • Running batch work in the same unprotected queue as interactive traffic

Production checklist

  • Define idempotent job and item identifiers
  • Persist model, tokenizer, and prompt versions
  • Bucket by input and output token budgets
  • Expose cancellation, partial results, and retry policy
  • Isolate or prioritize online capacity

What to measure

  • Useful tokens per GPU-second
  • Padding ratio and batch occupancy
  • Job queue age and completion ETA
  • Per-item retries and terminal failures
  • Cost per completed item

From one GPU to a production service

A notebook batch becomes a platform when jobs outlive processes, span workers, and must be audited. The control plane stores small durable state; object storage holds manifests and artifacts; the queue carries partition references; workers remain disposable.

Different endpoint types require different result contracts. Chat output is variable text plus usage, embeddings are dense vectors, and rerank produces ordered scores. A mixed batch may be convenient for clients but should still partition execution by compatible model and shape.

Capacity policy should reserve online headroom and assign batch deadlines. When online demand rises, batch workers can drain after a checkpoint rather than being killed mid-item. The job ETA should reflect available quota, not promise a completion time based on idle capacity.

Design-review questions

  • What makes an item commit idempotent?
  • Where are input, output, and model versions recorded?
  • Can a cancelled job stop already leased partitions?
  • How is online capacity protected from batch backlog?
  • Can operators retrieve partial results and exact failure reasons?

How it connects to the rest of the series

Dynamic batching groups live arrivals; continuous batching changes active decode membership each step. Batch inference is the broader durable workflow that can use either engine-level technique.

From equation to implementation

The durable data model should separate Job, Item, Attempt, and Artifact. A job owns policy and version metadata. An item owns stable input identity. An attempt records a worker lease and error. Artifacts point to input and output objects without forcing large payloads into the control database.

Exactly-once execution is rarely available across queues, databases, and model calls. Aim for at-least-once delivery with idempotent item commits. A worker obtains a lease, writes output under a deterministic key, and completes the item with a conditional update. Duplicate workers may compute, but only one result becomes authoritative.

Implementation sketch

job = create_job(manifest_uri, model_revision, prompt_version)
for partition in split_manifest(job):
    queue.publish(partition)
worker:
    lease = claim_partition(timeout)
    for bucket in group_by_token_shape(lease.items):
        outputs = engine.generate(bucket)
        for item, output in zip(bucket, outputs):
            put_if_absent(result_key(item), output)
            mark_complete_if_lease_owner(item)
    checkpoint_and_release(lease)

Capacity planning

Partition size should bound retry cost and result latency. Worker concurrency should be derived from GPU token capacity and storage throughput, not queue depth alone. The result store often becomes the bottleneck after inference is optimized.

Benchmarking without fooling yourself

  • Include object-store reads and result writes in job throughput.
  • Use mixed lengths and failure injection, not uniform synthetic prompts.
  • Measure restart recovery with workers killed mid-partition.
  • Compare cost per successful item, including retries and padding.

A production failure to design for

A worker loses its lease during a long model batch, but still writes outputs after a replacement worker has completed the same items. Without lease-generation checks, late writes overwrite newer results. Store the lease generation with every conditional commit.

Operational loopVersionModel prompt and dataStable item IDsExecuteLease and bucketCheckpoint progressRecoverRetry failed itemsReject stale writesAccountCost per good itemOnline capacity impact
Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Batch inference is an operational contract: durable work, explicit versions, bounded retries, and honest accounting. The GPU optimization only pays off when the job system is trustworthy.