5/20 - Batch Inference: When Throughput Matters More Than Immediacy

#batch-inference #offline-inference #throughput #llm-serving #data-pipelines

Not every inference request needs a blinking cursor and a token stream. Evaluating a million documents, enriching a catalog, generating embeddings, or running nightly moderation is a job, not a conversation. Batch inference optimizes for completed useful work per hour and per dollar.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Interactive inference is a taxi: leave now, even with one passenger. Batch inference is a train: group compatible work, fill the hardware, and accept a schedule in exchange for efficiency.

Follow the state and work from left to right.

Description: Start with Input manifest, where validate and bucket. The middle stage, Batch worker, pack compatible items. The final stage, Result store, shows the observable result: write per-item status. The arrows describe dependency order, not necessarily separate services.

What actually happens

A robust batch system separates submission, durable job state, scheduling, execution, and result storage. The API should acknowledge the job quickly. Workers claim partitions, process bounded chunks, checkpoint progress, and write results with stable item identifiers.

Length bucketing matters for language workloads. Padding every prompt to the longest item wastes compute and memory. Grouping similar prompt lengths and separating generation limits produces denser batches without silently changing per-item sampling configuration.

Retries must be item-aware and idempotent. Replaying an entire 10,000-item job because three calls failed duplicates spend and complicates output ordering. Persist terminal status, attempt count, model revision, prompt version, and output checksum per item.

A worked example

A dataset has 100,000 prompts: half near 200 tokens and half near 3,000. One mixed padded batch can charge most short prompts as if they were long. Two length buckets dramatically reduce padded tokens. If each item has a stable key, a worker restart resumes from unfinished items rather than regenerating completed outputs.

The performance model

Throughput depends on useful tokens per accelerator-second, not raw batch size. Very large batches can exceed memory, increase tail completion time, or reduce flexibility. Optimize batch size, token budget, and worker count jointly.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Batch inference changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Offline work can consume every spare GPU cycle and still damage online SLOs if both share a model pool. Use resource classes, quotas, priorities, and preemption boundaries. The cheapest batch run is not cheap if it causes customer-facing latency or cache eviction.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Interactive inference, characterized by optimize ttft and tpot and return one request now. The right panel applies Batch inference, changing the cost profile to optimize jobs per hour and durable asynchronous work. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Large evaluations and dataset enrichment
Embedding, reranking, and moderation pipelines
Cost-sensitive work with flexible completion deadlines

Where it disappoints

Using an HTTP timeout as a batch-job controller
Padding mixed lengths without accounting
Retrying whole jobs instead of failed items
Running batch work in the same unprotected queue as interactive traffic

Production checklist

Define idempotent job and item identifiers
Persist model, tokenizer, and prompt versions
Bucket by input and output token budgets
Expose cancellation, partial results, and retry policy
Isolate or prioritize online capacity

What to measure

Useful tokens per GPU-second
Padding ratio and batch occupancy
Job queue age and completion ETA
Per-item retries and terminal failures
Cost per completed item

From one GPU to a production service

A notebook batch becomes a platform when jobs outlive processes, span workers, and must be audited. The control plane stores small durable state; object storage holds manifests and artifacts; the queue carries partition references; workers remain disposable.

Different endpoint types require different result contracts. Chat output is variable text plus usage, embeddings are dense vectors, and rerank produces ordered scores. A mixed batch may be convenient for clients but should still partition execution by compatible model and shape.

Capacity policy should reserve online headroom and assign batch deadlines. When online demand rises, batch workers can drain after a checkpoint rather than being killed mid-item. The job ETA should reflect available quota, not promise a completion time based on idle capacity.

Design-review questions

What makes an item commit idempotent?
Where are input, output, and model versions recorded?
Can a cancelled job stop already leased partitions?
How is online capacity protected from batch backlog?
Can operators retrieve partial results and exact failure reasons?

How it connects to the rest of the series

Dynamic batching groups live arrivals; continuous batching changes active decode membership each step. Batch inference is the broader durable workflow that can use either engine-level technique.

From equation to implementation

The durable data model should separate Job, Item, Attempt, and Artifact. A job owns policy and version metadata. An item owns stable input identity. An attempt records a worker lease and error. Artifacts point to input and output objects without forcing large payloads into the control database.

Exactly-once execution is rarely available across queues, databases, and model calls. Aim for at-least-once delivery with idempotent item commits. A worker obtains a lease, writes output under a deterministic key, and completes the item with a conditional update. Duplicate workers may compute, but only one result becomes authoritative.

Implementation sketch

job = create_job(manifest_uri, model_revision, prompt_version)
for partition in split_manifest(job):
    queue.publish(partition)
worker:
    lease = claim_partition(timeout)
    for bucket in group_by_token_shape(lease.items):
        outputs = engine.generate(bucket)
        for item, output in zip(bucket, outputs):
            put_if_absent(result_key(item), output)
            mark_complete_if_lease_owner(item)
    checkpoint_and_release(lease)

Capacity planning

Partition size should bound retry cost and result latency. Worker concurrency should be derived from GPU token capacity and storage throughput, not queue depth alone. The result store often becomes the bottleneck after inference is optimized.

Benchmarking without fooling yourself

Include object-store reads and result writes in job throughput.
Use mixed lengths and failure injection, not uniform synthetic prompts.
Measure restart recovery with workers killed mid-partition.
Compare cost per successful item, including retries and padding.

A production failure to design for

A worker loses its lease during a long model batch, but still writes outputs after a replacement worker has completed the same items. Without lease-generation checks, late writes overwrite newer results. Store the lease generation with every conditional commit.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Version to Execute, then Recover and Account. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Batch inference is a durable data-processing system around model execution. Submission records the immutable input artifact, model revision, generation parameters, tenant, and idempotency key. Workers claim bounded shards, checkpoint progress, and write per-item results before an aggregator finalizes the job manifest. The control plane must survive worker loss without duplicating billable or externally visible effects.

Batch inference combines a state machine, artifact protocol, and throughput scheduler.

Description: Follow the state from Submit through Partition and Execute to Finalize. Each box is an ownership or computation boundary. In particular, job completion is derived from durable item state, never worker memory. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Exactly-once model execution is rarely practical; exactly-once visible results are. Give every item a deterministic identity, write outputs conditionally, and make retries overwrite or ignore the same logical slot. A worker lease has an expiry and heartbeat so abandoned shards return to the queue. Cancellation stops new claims and records what already completed.

Batch work trades immediate response for cost-efficient, recoverable execution.

Description: The bars compare Online endpoint with Batch fleet on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Large batches and relaxed deadlines raise utilization.”, remains larger than the risk, “Manifests, checkpoints, and retries consume IO.”, under production traffic.

Fairness must use estimated work, not job count. One million long-document requests should not sit beside a ten-item evaluation job in plain FIFO. Token estimates, model class, deadline, tenant quota, and aging form a practical scheduling key. Reserve capacity for small jobs so bulk tenants cannot create multi-hour head-of-line blocking.

A job state is a durable summary of item-level truth.

Description: State advances from Submitted to Running, Finalizing, and finally Terminal. The labels below each state identify what becomes true at that boundary. The governing invariant is: Every transition is idempotent and recoverable after process restart. Retries and cancellation must preserve the same transition rules.

The batch API is only the front door; these contracts carry the workload.

Description: The four panels are independent review axes: Input, Execution, Output, and Governance. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Treat storage lifecycle and inference lifecycle as one design.

Retries are normal in batch systems, so duplicate work must be harmless.

Description: This is a causal chain, not four unrelated symptoms. Lease expires triggers Two workers run, which creates Outputs race. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Batch inference is an operational contract: durable work, explicit versions, bounded retries, and honest accounting. The GPU optimization only pays off when the job system is trustworthy.

4/20 - PagedAttention: Virtual Memory for the KV Cache 6/20 - Early Exit Decoding: Stop Computing Once the Answer Is Clear