Batch Inference: When Throughput Matters More Than Immediacy
Not every inference request needs a blinking cursor and a token stream. Evaluating a million documents, enriching a catalog, generating embeddings, or running nightly moderation is a job, not a conversation. Batch inference optimizes for completed useful work per hour and per dollar.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
Interactive inference is a taxi: leave now, even with one passenger. Batch inference is a train: group compatible work, fill the hardware, and accept a schedule in exchange for efficiency.
What actually happens
A robust batch system separates submission, durable job state, scheduling, execution, and result storage. The API should acknowledge the job quickly. Workers claim partitions, process bounded chunks, checkpoint progress, and write results with stable item identifiers.
Length bucketing matters for language workloads. Padding every prompt to the longest item wastes compute and memory. Grouping similar prompt lengths and separating generation limits produces denser batches without silently changing per-item sampling configuration.
Retries must be item-aware and idempotent. Replaying an entire 10,000-item job because three calls failed duplicates spend and complicates output ordering. Persist terminal status, attempt count, model revision, prompt version, and output checksum per item.
A worked example
A dataset has 100,000 prompts: half near 200 tokens and half near 3,000. One mixed padded batch can charge most short prompts as if they were long. Two length buckets dramatically reduce padded tokens. If each item has a stable key, a worker restart resumes from unfinished items rather than regenerating completed outputs.
The performance model
Throughput depends on useful tokens per accelerator-second, not raw batch size. Very large batches can exceed memory, increase tail completion time, or reduce flexibility. Optimize batch size, token budget, and worker count jointly.
Expert lens
Offline work can consume every spare GPU cycle and still damage online SLOs if both share a model pool. Use resource classes, quotas, priorities, and preemption boundaries. The cheapest batch run is not cheap if it causes customer-facing latency or cache eviction.
Where it wins
- Large evaluations and dataset enrichment
- Embedding, reranking, and moderation pipelines
- Cost-sensitive work with flexible completion deadlines
Where it disappoints
- Using an HTTP timeout as a batch-job controller
- Padding mixed lengths without accounting
- Retrying whole jobs instead of failed items
- Running batch work in the same unprotected queue as interactive traffic
Production checklist
- Define idempotent job and item identifiers
- Persist model, tokenizer, and prompt versions
- Bucket by input and output token budgets
- Expose cancellation, partial results, and retry policy
- Isolate or prioritize online capacity
What to measure
- Useful tokens per GPU-second
- Padding ratio and batch occupancy
- Job queue age and completion ETA
- Per-item retries and terminal failures
- Cost per completed item
From one GPU to a production service
A notebook batch becomes a platform when jobs outlive processes, span workers, and must be audited. The control plane stores small durable state; object storage holds manifests and artifacts; the queue carries partition references; workers remain disposable.
Different endpoint types require different result contracts. Chat output is variable text plus usage, embeddings are dense vectors, and rerank produces ordered scores. A mixed batch may be convenient for clients but should still partition execution by compatible model and shape.
Capacity policy should reserve online headroom and assign batch deadlines. When online demand rises, batch workers can drain after a checkpoint rather than being killed mid-item. The job ETA should reflect available quota, not promise a completion time based on idle capacity.
Design-review questions
- What makes an item commit idempotent?
- Where are input, output, and model versions recorded?
- Can a cancelled job stop already leased partitions?
- How is online capacity protected from batch backlog?
- Can operators retrieve partial results and exact failure reasons?
How it connects to the rest of the series
Dynamic batching groups live arrivals; continuous batching changes active decode membership each step. Batch inference is the broader durable workflow that can use either engine-level technique.
From equation to implementation
The durable data model should separate Job, Item, Attempt, and Artifact. A job owns policy and version metadata. An item owns stable input identity. An attempt records a worker lease and error. Artifacts point to input and output objects without forcing large payloads into the control database.
Exactly-once execution is rarely available across queues, databases, and model calls. Aim for at-least-once delivery with idempotent item commits. A worker obtains a lease, writes output under a deterministic key, and completes the item with a conditional update. Duplicate workers may compute, but only one result becomes authoritative.
Implementation sketch
job = create_job(manifest_uri, model_revision, prompt_version)
for partition in split_manifest(job):
queue.publish(partition)
worker:
lease = claim_partition(timeout)
for bucket in group_by_token_shape(lease.items):
outputs = engine.generate(bucket)
for item, output in zip(bucket, outputs):
put_if_absent(result_key(item), output)
mark_complete_if_lease_owner(item)
checkpoint_and_release(lease)Capacity planning
Partition size should bound retry cost and result latency. Worker concurrency should be derived from GPU token capacity and storage throughput, not queue depth alone. The result store often becomes the bottleneck after inference is optimized.
Benchmarking without fooling yourself
- Include object-store reads and result writes in job throughput.
- Use mixed lengths and failure injection, not uniform synthetic prompts.
- Measure restart recovery with workers killed mid-partition.
- Compare cost per successful item, including retries and padding.
A production failure to design for
A worker loses its lease during a long model batch, but still writes outputs after a replacement worker has completed the same items. Without lease-generation checks, late writes overwrite newer results. Store the lease generation with every conditional commit.
Primary references
The takeaway
Batch inference is an operational contract: durable work, explicit versions, bounded retries, and honest accounting. The GPU optimization only pays off when the job system is trustworthy.
