From H100 to Blackwell: What Actually Changes for Inference Architects

#inference #gpu #h100 #h200 #blackwell #gb200 #nvlink #tensorrt-llm #dynamo

Every GPU generation arrives with a fog machine of numbers. More FLOPS. More bandwidth. More memory. More charts that look like a rocket launch.

For inference architects, the useful question is simpler:

What design decisions change?

H100 made transformer inference a mainstream production platform. H200 pushed the memory story forward with more and faster HBM. Blackwell changes the architecture conversation again: lower precision, bigger scale-up domains, stronger attention and MoE paths, and a software stack increasingly built around cost per token rather than isolated kernel speed.

Let us strip away the fog machine.

H100: the baseline everybody learned

H100 became the reference point for modern LLM serving because it combined strong tensor performance, mature CUDA software, FP8 support, and a broad deployment ecosystem. If you learned vLLM, TensorRT-LLM, batching, KV cache pressure, and multi-GPU serving in production, there is a decent chance H100 was in the room.

Architecturally, H100-era inference taught three lessons:

KV cache memory matters almost as much as weights.
Batching and scheduling determine realized throughput.
Software improvements can extend hardware life.

NVIDIA’s MLPerf posts have repeatedly emphasized software uplift on Hopper. That matters because inference hardware is not static. A GPU bought today can produce more useful tokens next quarter if the engine improves.

H200: memory becomes the headline

H200 is best understood as a memory-forward Hopper evolution. NVIDIA describes H200 as having 141 GB of HBM3e at 4.8 TB/s, compared with H100’s smaller and slower memory configuration. For LLM serving, that changes practical capacity:

larger models fit more comfortably
larger batches become possible
KV cache pressure relaxes
fewer tensor-parallel splits may be needed for some workloads

H100 taught the playbook. H200 widened memory. Blackwell changes the system boundary.

Memory is not glamorous, but it is often the difference between “this model fits” and “please enjoy eight-way tensor parallelism for reasons.”

Blackwell: precision and scale-up change the design

Blackwell is not just “more GPU.” The architectural themes that matter for inference are:

fifth-generation Tensor Cores
native lower-precision formats such as NVFP4
stronger Transformer Engine direction
NVLink scale-up domains such as GB200 NVL72
better paths for MoE and reasoning models
software stacks tuned around throughput per watt and cost per token

NVIDIA describes GB200 NVL72 as a 72-GPU NVLink domain. NVIDIA technical material around MoE inference highlights 1.8 TB/s bidirectional bandwidth per GPU in GB200 NVL72 and emphasizes sparse MoE communication. This matters because MoE models exchange expert traffic frequently. Weak interconnects turn MoE into a networking exercise with a model attached.

Blackwell also shifts the precision conversation. NVFP4 is interesting because inference economics love smaller data types, but quality does not love being thrown out a window. The win is not “4-bit everything always.” The win is hardware and software that make low precision usable for the right models and layers with acceptable quality.

Software becomes a first-class architecture layer

The H100-to-Blackwell story is impossible to tell honestly without software.

TensorRT-LLM improves kernels, batching, quantization, and model execution. NIM packages optimized inference microservices. Dynamo handles distributed serving concerns such as KV-aware routing, disaggregated serving, scheduling, and KV movement. vLLM and SGLang also continue pushing open serving forward.

For architects, this means GPU selection is no longer separable from engine selection.

The GPU matters more when the layers above it know how to use what changed.

If you run Blackwell like a bigger H100, you will get some benefit. If you redesign around larger scale-up domains, lower precision, MoE communication, and distributed cache movement, you get the architectural benefit.

Migration planning questions

Before moving a workload forward a generation, I would ask:

Does the model become memory-bound, compute-bound, or communication-bound at target concurrency?
Which precision formats preserve quality for this model and prompt distribution?
Does the workload benefit from larger NVLink scale-up domains, or is it mostly embarrassingly parallel?
How much of the current cost is recomputing prefixes that a better cache path could avoid?
Can the serving engine expose the metrics needed to prove the new hardware is helping?
Is the gateway ready to route by model, precision, cache locality, and SLO class?

The best migration plans are not “replace H100 with Blackwell.” They are “move this workload to a different hardware-software operating point and prove the token economics improved.”

What actually changes in design reviews

Ask different questions:

Before: Can the model fit on one GPU?

Now: What model, precision, KV cache size, and concurrency target fit inside the latency SLO?

Before: What is peak tokens per second?

Now: What is cost per useful token at target TTFT and TPOT?

Before: Do we need tensor parallelism?

Now: Which parallelism strategy matches prefill, decode, and MoE communication?

Before: Is the backend healthy?

Now: Is the backend meeting token SLOs, preserving cache locality, and avoiding queue buildup?

Before: Which GPU is fastest?

Now: Which hardware-software stack gives the best realized economics for this workload?

Caveats, because reality is rude

Blackwell will not fix:

bad prompts
poor batching
cache-hostile routing
bad model choice
missing cancellation
low-quality quantization
unmeasured workload distributions

Also, vendor benchmarks are not your workload. MLPerf is useful because it is standardized and public, but production traffic has different sequence lengths, tenant behavior, cache reuse, retries, and quality constraints. Use public benchmarks to choose what to test. Use your own workload to choose what to buy.

Closing

The move from H100 to H200 to Blackwell is not just a chart moving up and to the right. It is a shift from GPU-centric inference toward stack-centric inference.

H100 made the modern playbook real. H200 gave memory-heavy workloads more room. Blackwell pushes architects toward lower precision, larger NVLink domains, MoE-aware design, and software that treats tokens as the economic unit.

That is the real change: the winning architecture is no longer the one with the biggest accelerator number. It is the one that turns the full stack into useful tokens, predictably and cheaply.

The spreadsheet may still have FLOPS in it. Just do not let FLOPS drive.

Sources and receipts

H200 memory and MLPerf context: NVIDIA H200 MLPerf blog.
Blackwell MLPerf Inference v5.0: NVIDIA technical blog.
GB200 NVL72 and NVLink scale-up: GB200 NVL72 product page, GB200 NVL72 technical blog, and GB200/Dynamo MoE blog.
Blackwell and MoE inference optimizations: NVIDIA technical blog.
TensorRT-LLM and NIM: TensorRT-LLM docs and NIM overview.

Speculative Decoding in Production: When Draft Tokens Help and When They Hurt Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization