Skip to content
5/20 - Batch Inference: When Throughput Matters More Than Immediacy

5/20 - Batch Inference: When Throughput Matters More Than Immediacy

Not every inference request needs a blinking cursor and a token stream. Evaluating a million documents, enriching a catalog, generating embeddings, or running nightly moderation is a job, not a conversation. Batch inference optimizes for completed useful work per hour and per dollar.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Interactive inference is a taxi: leave now, even with one passenger. Batch inference is a train: group compatible work, fill the hardware, and accept a schedule in exchange for efficiency.

MECHANISM FLOWBatch Inference: request path01Input manifestValidate and bucketAssign stable IDs02Batch workerPack compatible itemsRun large batches03Result storeWrite per-item statusRetry only failuresINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Input manifest, where validate and bucket. The middle stage, Batch worker, pack compatible items. The final stage, Result store, shows the observable result: write per-item status. The arrows describe dependency order, not necessarily separate services.

What actually happens

A robust batch system separates submission, durable job state, scheduling, execution, and result storage. The API should acknowledge the job quickly. Workers claim partitions, process bounded chunks, checkpoint progress, and write results with stable item identifiers.

Length bucketing matters for language workloads. Padding every prompt to the longest item wastes compute and memory. Grouping similar prompt lengths and separating generation limits produces denser batches without silently changing per-item sampling configuration.

Retries must be item-aware and idempotent. Replaying an entire 10,000-item job because three calls failed duplicates spend and complicates output ordering. Persist terminal status, attempt count, model revision, prompt version, and output checksum per item.

A worked example

A dataset has 100,000 prompts: half near 200 tokens and half near 3,000. One mixed padded batch can charge most short prompts as if they were long. Two length buckets dramatically reduce padded tokens. If each item has a stable key, a worker restart resumes from unfinished items rather than regenerating completed outputs.

The performance model

Throughput depends on useful tokens per accelerator-second, not raw batch size. Very large batches can exceed memory, increase tail completion time, or reduce flexibility. Optimize batch size, token budget, and worker count jointly.

PHASE FITWhere Batch inference changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityPacks prompt work for throughputDECODEOne new token per iterationWeight and KV bandwidth pressureAmortizes generation across jobsPROVE IT WITHGoodput, cost, and completion timeDEPLOYMENT DECISIONOptimize fleet for durable throughput
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Batch inference changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Offline work can consume every spare GPU cycle and still damage online SLOs if both share a model pool. Use resource classes, quotas, priorities, and preemption boundaries. The cheapest batch run is not cheap if it causes customer-facing latency or cache eviction.

TRADE-OFF MAPBatch Inference: the tradeoffBASELINEInteractive inferenceOptimize TTFT and TPOTReturn one request nowSmall elastic batchesClient holds connectionVSOPTIMIZEDBatch inferenceOptimize jobs per hourDurable asynchronous workLength-bucketed large batchesResults fetched laterMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Interactive inference, characterized by optimize ttft and tpot and return one request now. The right panel applies Batch inference, changing the cost profile to optimize jobs per hour and durable asynchronous work. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Large evaluations and dataset enrichment
  • Embedding, reranking, and moderation pipelines
  • Cost-sensitive work with flexible completion deadlines

Where it disappoints

  • Using an HTTP timeout as a batch-job controller
  • Padding mixed lengths without accounting
  • Retrying whole jobs instead of failed items
  • Running batch work in the same unprotected queue as interactive traffic

Production checklist

  • Define idempotent job and item identifiers
  • Persist model, tokenizer, and prompt versions
  • Bucket by input and output token budgets
  • Expose cancellation, partial results, and retry policy
  • Isolate or prioritize online capacity

What to measure

  • Useful tokens per GPU-second
  • Padding ratio and batch occupancy
  • Job queue age and completion ETA
  • Per-item retries and terminal failures
  • Cost per completed item

From one GPU to a production service

A notebook batch becomes a platform when jobs outlive processes, span workers, and must be audited. The control plane stores small durable state; object storage holds manifests and artifacts; the queue carries partition references; workers remain disposable.

Different endpoint types require different result contracts. Chat output is variable text plus usage, embeddings are dense vectors, and rerank produces ordered scores. A mixed batch may be convenient for clients but should still partition execution by compatible model and shape.

Capacity policy should reserve online headroom and assign batch deadlines. When online demand rises, batch workers can drain after a checkpoint rather than being killed mid-item. The job ETA should reflect available quota, not promise a completion time based on idle capacity.

Design-review questions

  • What makes an item commit idempotent?
  • Where are input, output, and model versions recorded?
  • Can a cancelled job stop already leased partitions?
  • How is online capacity protected from batch backlog?
  • Can operators retrieve partial results and exact failure reasons?

How it connects to the rest of the series

Dynamic batching groups live arrivals; continuous batching changes active decode membership each step. Batch inference is the broader durable workflow that can use either engine-level technique.

From equation to implementation

The durable data model should separate Job, Item, Attempt, and Artifact. A job owns policy and version metadata. An item owns stable input identity. An attempt records a worker lease and error. Artifacts point to input and output objects without forcing large payloads into the control database.

Exactly-once execution is rarely available across queues, databases, and model calls. Aim for at-least-once delivery with idempotent item commits. A worker obtains a lease, writes output under a deterministic key, and completes the item with a conditional update. Duplicate workers may compute, but only one result becomes authoritative.

Implementation sketch

job = create_job(manifest_uri, model_revision, prompt_version)
for partition in split_manifest(job):
    queue.publish(partition)
worker:
    lease = claim_partition(timeout)
    for bucket in group_by_token_shape(lease.items):
        outputs = engine.generate(bucket)
        for item, output in zip(bucket, outputs):
            put_if_absent(result_key(item), output)
            mark_complete_if_lease_owner(item)
    checkpoint_and_release(lease)

Capacity planning

Partition size should bound retry cost and result latency. Worker concurrency should be derived from GPU token capacity and storage throughput, not queue depth alone. The result store often becomes the bottleneck after inference is optimized.

Benchmarking without fooling yourself

  • Include object-store reads and result writes in job throughput.
  • Use mixed lengths and failure injection, not uniform synthetic prompts.
  • Measure restart recovery with workers killed mid-partition.
  • Compare cost per successful item, including retries and padding.

A production failure to design for

A worker loses its lease during a long model batch, but still writes outputs after a replacement worker has completed the same items. Without lease-generation checks, late writes overwrite newer results. Store the lease generation with every conditional commit.

OPERATING LOOPOperational loop1VersionModel prompt and dataStable item IDs2ExecuteLease and bucketCheckpoint progress3RecoverRetry failed itemsReject stale writes4AccountCost per good itemOnline capacity impactMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Version to Execute, then Recover and Account. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Batch inference is a durable data-processing system around model execution. Submission records the immutable input artifact, model revision, generation parameters, tenant, and idempotency key. Workers claim bounded shards, checkpoint progress, and write per-item results before an aggregator finalizes the job manifest. The control plane must survive worker loss without duplicating billable or externally visible effects.

A durable batch inference jobSubmitValidate manifestFreeze model revisionStore idempotency keyPartitionCreate bounded shardsEstimate token costAssign prioritiesExecuteLease shard to workerCheckpoint resultsRetry failed itemsFinalizeSeal output manifestPublish accountingExpire artifactsJob completion is derived from durable item state, never worker memory.
Batch inference combines a state machine, artifact protocol, and throughput scheduler.

How to read this diagram: Follow the state from Submit through Partition and Execute to Finalize. Each box is an ownership or computation boundary. In particular, job completion is derived from durable item state, never worker memory. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Exactly-once model execution is rarely practical; exactly-once visible results are. Give every item a deterministic identity, write outputs conditionally, and make retries overwrite or ignore the same logical slot. A worker lease has an expiry and heartbeat so abandoned shards return to the queue. Cancellation stops new claims and records what already completed.

Offline scheduling changes the optimization targetOnline endpointlatency firstBatch fleetgoodput firstDurability taxManifests, checkpoints, and retries consume IO.Fleet efficiencyLarge batches and relaxed deadlines raise utilization.
Batch work trades immediate response for cost-efficient, recoverable execution.

How to read this diagram: The bars compare Online endpoint with Batch fleet on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Large batches and relaxed deadlines raise utilization.”, remains larger than the risk, “Manifests, checkpoints, and retries consume IO.”, under production traffic.

Fairness must use estimated work, not job count. One million long-document requests should not sit beside a ten-item evaluation job in plain FIFO. Token estimates, model class, deadline, tenant quota, and aging form a practical scheduling key. Reserve capacity for small jobs so bulk tenants cannot create multi-hour head-of-line blocking.

Batch job lifecycleSubmittedmanifest is durableRunningshards hold leasesFinalizingresults reconcileTerminalcomplete or failedEvery transition is idempotent and recoverable after process restart.
A job state is a durable summary of item-level truth.

How to read this diagram: State advances from Submitted to Running, Finalizing, and finally Terminal. The labels below each state identify what becomes true at that boundary. The governing invariant is: Every transition is idempotent and recoverable after process restart. Retries and cancellation must preserve the same transition rules.

Four contracts make batch reliableInputImmutable artifact digestSchema and model pinExecutionLease and checkpointPer-item retry policyOutputStable item identityAtomic final manifestGovernanceQuota and retentionAudit and cost ledgerTreat storage lifecycle and inference lifecycle as one design.
The batch API is only the front door; these contracts carry the workload.

How to read this diagram: The four panels are independent review axes: Input, Execution, Output, and Governance. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Treat storage lifecycle and inference lifecycle as one design.

A duplicate-processing failure chainLease expiresWorker is only slowShard is reclaimedTwo workers runSame items executeCosts duplicateOutputs raceManifest disagreesCounts become wrongControlConditional item writesReconcile by identityLeases provide liveness; idempotent output commits provide correctness.
Retries are normal in batch systems, so duplicate work must be harmless.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Lease expires triggers Two workers run, which creates Outputs race. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Batch inference is an operational contract: durable work, explicit versions, bounded retries, and honest accounting. The GPU optimization only pays off when the job system is trustworthy.