13/20 - Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model
A framework executes the model you wrote. An optimizing runtime tries to execute the model you meant. By seeing the graph as a whole, it can erase redundant work, fold constants, fuse operators, select layouts, and build kernels that no individual layer requested.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
Editing a travel itinerary one trip at a time misses that three errands share the same route. Graph optimization lays the whole map on the table and combines compatible journeys.
How to read this diagram: Start with Exported graph, where infer shapes. The middle stage, Graph optimizer, fuse patterns. The final stage, Engine plan, shows the observable result: select kernels. The arrows describe dependency order, not necessarily separate services.
What actually happens
Basic rewrites remove identities, unused nodes, redundant casts, and constant subgraphs. Extended passes recognize patterns such as MatMul plus bias plus GELU, attention blocks, or skip-layer normalization and replace them with fused implementations.
ONNX Runtime can optimize online when a session starts or serialize an offline-optimized model. TensorRT builds an engine for target hardware, shapes, and precision constraints, timing candidate tactics before selecting implementations.
Graph partitioning matters when multiple execution providers are present. Unsupported operators create boundaries, device copies, and lost fusion opportunities. A graph that is mostly on GPU can still be slow if a small CPU island forces synchronization.
A worked example
Suppose a transformer MLP appears as MatMul, Add, GELU, MatMul, Add. Five launches materialize intermediates. A supported fusion can combine bias and activation work into fewer kernels, reducing launch overhead and HBM round trips while preserving semantics.
The performance model
A useful model is optimized latency = kernel work plus launches plus memory traffic plus device transitions. Graph rewrites attack all four, but engine build time and hardware specificity move cost to deployment.
How to read this diagram: The left panel asks how Graph optimization changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.
Expert lens
Dynamic shapes widen tactic choices but can prevent specialization. Production engines often define optimization profiles with minimum, typical, and maximum shapes. Profiles should reflect real prompt and batch distributions, not arbitrary extremes.
How to read this diagram: The left panel is the baseline, Framework graph, characterized by operator-by-operator execution and generic layouts. The right panel applies Optimized engine, changing the cost profile to constant folding and fusion and hardware-selected tactics. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.
Where it wins
- Stable production models and shape ranges
- Graphs with fusible transformer patterns
- Deployments that can cache target-specific engines
Where it disappoints
- Exporting unsupported custom operators
- Using one enormous dynamic-shape profile
- Assuming optimization preserved quality without checks
- Rebuilding engines on every startup
Production checklist
- Inspect unsupported nodes and device partitions
- Define realistic shape profiles
- Serialize optimized artifacts when appropriate
- Compare outputs within numerical tolerances
- Profile layer and fusion reports
What to measure
- Node count before and after optimization
- Kernel launches per request
- CPU-GPU transfer count
- Engine build and load time
- Latency by optimization profile
From one GPU to a production service
A developer can rebuild an engine whenever code starts. A production fleet should build once in a controlled pipeline, sign the artifact, scan plugins, attach provenance, and distribute an immutable engine compatible with the target GPU class.
Model validation must compare both semantics and performance. An output can be numerically correct while a lost fusion doubles latency. Conversely, a fast engine can violate task quality through an unsafe precision or approximation choice.
Plan for rollback at artifact level. Keep the previous engine loadable, route canary traffic to the new build, and compare layer reports plus request traces before broad promotion.
Design-review questions
- Is the engine built for the exact runtime and GPU class?
- Which graph partitions or fusions changed from the prior version?
- Are optimization profiles derived from production shapes?
- Can plugins be reproduced and security-reviewed?
- What automatic signal triggers rollback?
How it connects to the rest of the series
Mixed precision and quantized kernels give the optimizer more tactic choices. FlashAttention may appear as a specialized fused operator. Dynamic batching determines the shapes the engine sees.
From equation to implementation
Graph optimization is constrained by semantics, shapes, and side effects. Constant folding is safe only for values known at build time. Fusion requires compatible layouts, precision, and consumers. A single extra consumer of an intermediate can prevent an otherwise obvious fusion.
TensorRT tactic selection benchmarks candidate kernels within a workspace and precision budget. The resulting engine encodes choices for a GPU family and optimization profiles. Reproducible deployment therefore versions the source model, builder version, flags, plugins, calibration data, and target compute capability.
Implementation sketch
export_model_to_onnx(opset, dynamic_axes)
run_onnx_checker_and_shape_inference()
optimize_basic_and_extended_graph()
partition_by_execution_provider()
for profile in production_shape_profiles:
build_tensorrt_engine(profile, precision_constraints)
inspect_fusions_and_fallbacks()
validate_outputs(reference_runtime, optimized_engine)
publish_versioned_engine_artifact()Capacity planning
Builder workspace can be much larger than runtime workspace and should not be confused with steady inference memory. Each shape profile may add tactics or buffers. Limit profiles to useful ranges and build separate engines when workloads are materially different.
Benchmarking without fooling yourself
- Compare cold build, warm load, and steady request latency.
- Dump layer information and count fused versus fallback regions.
- Test every optimization profile boundary and typical shape.
- Use tolerance plus task-level validation for lower precision engines.
A production failure to design for
An ONNX export upgrades opset and represents rotary embedding through an unsupported pattern. TensorRT partitions around a plugin gap, adding device copies. The engine builds successfully but slows down. Gate releases on partition and fusion diffs, not build success.
How to read this diagram: The operating cycle moves from Export to Build, then Inspect and Publish. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.
Deeper engineering guide
Graph optimization converts framework-level operations into an execution plan for known shapes, dtypes, and hardware. Constant folding removes fixed work, algebraic simplification collapses redundant paths, fusion reduces launches and intermediate tensors, and memory planning reuses buffers whose lifetimes do not overlap.
How to read this diagram: Follow the state from Export through Rewrite and Compile to Validate. Each box is an ownership or computation boundary. In particular, the engine is valid only for its model, profiles, plugins, and hardware target. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.
Dynamic shapes are bounded contracts, not unlimited flexibility. TensorRT profiles define minimum, optimum, and maximum shapes; tactics selected near the optimum may perform poorly at extremes. Multiple profiles can preserve performance but increase build time and engine size. Route requests to an explicit profile and observe fallback or reformat layers.
How to read this diagram: The bars compare Unfused graph with Fused engine on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Less launch overhead and intermediate memory traffic.”, remains larger than the risk, “A shape outside profiles may fail or take a fallback path.”, under production traffic.
Custom plugins extend unsupported operators but become part of the trusted runtime. Version their ABI, numerical contract, workspace requirements, and supported formats. A plugin that silently accepts an unsupported dtype can produce plausible but incorrect output.
How to read this diagram: State advances from Exported to Optimized, Compiled, and finally Certified. The labels below each state identify what becomes true at that boundary. The governing invariant is: Any dependency, GPU, profile, or plugin change requires compatibility review. Retries and cancellation must preserve the same transition rules.
How to read this diagram: The four panels are independent review axes: Semantic, Coverage, Performance, and Operations. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: A successful engine build is the start of validation, not the end.
How to read this diagram: This is a causal chain, not four unrelated symptoms. Shape arrives triggers Runtime adapts, which creates SLO breaks. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.
Primary references
The takeaway
Graph optimization converts a correct model into a hardware plan. The gain comes from global visibility, but so does the responsibility to validate shapes, precision, and portability.
