13/20 - Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model

#onnx #tensorrt #graph-optimization #kernel-fusion #inference

A framework executes the model you wrote. An optimizing runtime tries to execute the model you meant. By seeing the graph as a whole, it can erase redundant work, fold constants, fuse operators, select layouts, and build kernels that no individual layer requested.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Editing a travel itinerary one trip at a time misses that three errands share the same route. Graph optimization lays the whole map on the table and combines compatible journeys.

Follow the state and work from left to right.

Description: Start with Exported graph, where infer shapes. The middle stage, Graph optimizer, fuse patterns. The final stage, Engine plan, shows the observable result: select kernels. The arrows describe dependency order, not necessarily separate services.

What actually happens

Basic rewrites remove identities, unused nodes, redundant casts, and constant subgraphs. Extended passes recognize patterns such as MatMul plus bias plus GELU, attention blocks, or skip-layer normalization and replace them with fused implementations.

ONNX Runtime can optimize online when a session starts or serialize an offline-optimized model. TensorRT builds an engine for target hardware, shapes, and precision constraints, timing candidate tactics before selecting implementations.

Graph partitioning matters when multiple execution providers are present. Unsupported operators create boundaries, device copies, and lost fusion opportunities. A graph that is mostly on GPU can still be slow if a small CPU island forces synchronization.

A worked example

Suppose a transformer MLP appears as MatMul, Add, GELU, MatMul, Add. Five launches materialize intermediates. A supported fusion can combine bias and activation work into fewer kernels, reducing launch overhead and HBM round trips while preserving semantics.

The performance model

A useful model is optimized latency = kernel work plus launches plus memory traffic plus device transitions. Graph rewrites attack all four, but engine build time and hardware specificity move cost to deployment.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Graph optimization changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Dynamic shapes widen tactic choices but can prevent specialization. Production engines often define optimization profiles with minimum, typical, and maximum shapes. Profiles should reflect real prompt and batch distributions, not arbitrary extremes.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Framework graph, characterized by operator-by-operator execution and generic layouts. The right panel applies Optimized engine, changing the cost profile to constant folding and fusion and hardware-selected tactics. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Stable production models and shape ranges
Graphs with fusible transformer patterns
Deployments that can cache target-specific engines

Where it disappoints

Exporting unsupported custom operators
Using one enormous dynamic-shape profile
Assuming optimization preserved quality without checks
Rebuilding engines on every startup

Production checklist

Inspect unsupported nodes and device partitions
Define realistic shape profiles
Serialize optimized artifacts when appropriate
Compare outputs within numerical tolerances
Profile layer and fusion reports

What to measure

Node count before and after optimization
Kernel launches per request
CPU-GPU transfer count
Engine build and load time
Latency by optimization profile

From one GPU to a production service

A developer can rebuild an engine whenever code starts. A production fleet should build once in a controlled pipeline, sign the artifact, scan plugins, attach provenance, and distribute an immutable engine compatible with the target GPU class.

Model validation must compare both semantics and performance. An output can be numerically correct while a lost fusion doubles latency. Conversely, a fast engine can violate task quality through an unsafe precision or approximation choice.

Plan for rollback at artifact level. Keep the previous engine loadable, route canary traffic to the new build, and compare layer reports plus request traces before broad promotion.

Design-review questions

Is the engine built for the exact runtime and GPU class?
Which graph partitions or fusions changed from the prior version?
Are optimization profiles derived from production shapes?
Can plugins be reproduced and security-reviewed?
What automatic signal triggers rollback?

How it connects to the rest of the series

Mixed precision and quantized kernels give the optimizer more tactic choices. FlashAttention may appear as a specialized fused operator. Dynamic batching determines the shapes the engine sees.

From equation to implementation

Graph optimization is constrained by semantics, shapes, and side effects. Constant folding is safe only for values known at build time. Fusion requires compatible layouts, precision, and consumers. A single extra consumer of an intermediate can prevent an otherwise obvious fusion.

TensorRT tactic selection benchmarks candidate kernels within a workspace and precision budget. The resulting engine encodes choices for a GPU family and optimization profiles. Reproducible deployment therefore versions the source model, builder version, flags, plugins, calibration data, and target compute capability.

Implementation sketch

export_model_to_onnx(opset, dynamic_axes)
run_onnx_checker_and_shape_inference()
optimize_basic_and_extended_graph()
partition_by_execution_provider()
for profile in production_shape_profiles:
    build_tensorrt_engine(profile, precision_constraints)
    inspect_fusions_and_fallbacks()
validate_outputs(reference_runtime, optimized_engine)
publish_versioned_engine_artifact()

Capacity planning

Builder workspace can be much larger than runtime workspace and should not be confused with steady inference memory. Each shape profile may add tactics or buffers. Limit profiles to useful ranges and build separate engines when workloads are materially different.

Benchmarking without fooling yourself

Compare cold build, warm load, and steady request latency.
Dump layer information and count fused versus fallback regions.
Test every optimization profile boundary and typical shape.
Use tolerance plus task-level validation for lower precision engines.

A production failure to design for

An ONNX export upgrades opset and represents rotary embedding through an unsupported pattern. TensorRT partitions around a plugin gap, adding device copies. The engine builds successfully but slows down. Gate releases on partition and fusion diffs, not build success.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Export to Build, then Inspect and Publish. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Graph optimization converts framework-level operations into an execution plan for known shapes, dtypes, and hardware. Constant folding removes fixed work, algebraic simplification collapses redundant paths, fusion reduces launches and intermediate tensors, and memory planning reuses buffers whose lifetimes do not overlap.

Optimization is a chain of semantic rewrites followed by hardware specialization.

Description: Follow the state from Export through Rewrite and Compile to Validate. Each box is an ownership or computation boundary. In particular, the engine is valid only for its model, profiles, plugins, and hardware target. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Dynamic shapes are bounded contracts, not unlimited flexibility. TensorRT profiles define minimum, optimum, and maximum shapes; tactics selected near the optimum may perform poorly at extremes. Multiple profiles can preserve performance but increase build time and engine size. Route requests to an explicit profile and observe fallback or reformat layers.

Graph speedup depends on actual fusion coverage and profile selection.

Description: The bars compare Unfused graph with Fused engine on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Less launch overhead and intermediate memory traffic.”, remains larger than the risk, “A shape outside profiles may fail or take a fallback path.”, under production traffic.

Custom plugins extend unsupported operators but become part of the trusted runtime. Version their ABI, numerical contract, workspace requirements, and supported formats. A plugin that silently accepts an unsupported dtype can produce plausible but incorrect output.

Treat engines as immutable release artifacts with provenance.

Description: State advances from Exported to Optimized, Compiled, and finally Certified. The labels below each state identify what becomes true at that boundary. The governing invariant is: Any dependency, GPU, profile, or plugin change requires compatibility review. Retries and cancellation must preserve the same transition rules.

Optimization is production-ready only when every supported profile is certified.

Description: The four panels are independent review axes: Semantic, Coverage, Performance, and Operations. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: A successful engine build is the start of validation, not the end.

Shape governance prevents dynamic inputs from bypassing compile-time assumptions.

Description: This is a causal chain, not four unrelated symptoms. Shape arrives triggers Runtime adapts, which creates SLO breaks. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Graph optimization converts a correct model into a hardware plan. The gain comes from global visibility, but so does the responsibility to validate shapes, precision, and portability.

12/20 - Sequence Parallelism: Divide the Tokens, Not the Meaning 14/20 - Dynamic Batching: Waiting Microseconds to Save Milliseconds