Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model
A framework executes the model you wrote. An optimizing runtime tries to execute the model you meant. By seeing the graph as a whole, it can erase redundant work, fold constants, fuse operators, select layouts, and build kernels that no individual layer requested.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
Editing a travel itinerary one trip at a time misses that three errands share the same route. Graph optimization lays the whole map on the table and combines compatible journeys.
What actually happens
Basic rewrites remove identities, unused nodes, redundant casts, and constant subgraphs. Extended passes recognize patterns such as MatMul plus bias plus GELU, attention blocks, or skip-layer normalization and replace them with fused implementations.
ONNX Runtime can optimize online when a session starts or serialize an offline-optimized model. TensorRT builds an engine for target hardware, shapes, and precision constraints, timing candidate tactics before selecting implementations.
Graph partitioning matters when multiple execution providers are present. Unsupported operators create boundaries, device copies, and lost fusion opportunities. A graph that is mostly on GPU can still be slow if a small CPU island forces synchronization.
A worked example
Suppose a transformer MLP appears as MatMul, Add, GELU, MatMul, Add. Five launches materialize intermediates. A supported fusion can combine bias and activation work into fewer kernels, reducing launch overhead and HBM round trips while preserving semantics.
The performance model
A useful model is optimized latency = kernel work plus launches plus memory traffic plus device transitions. Graph rewrites attack all four, but engine build time and hardware specificity move cost to deployment.
Expert lens
Dynamic shapes widen tactic choices but can prevent specialization. Production engines often define optimization profiles with minimum, typical, and maximum shapes. Profiles should reflect real prompt and batch distributions, not arbitrary extremes.
Where it wins
- Stable production models and shape ranges
- Graphs with fusible transformer patterns
- Deployments that can cache target-specific engines
Where it disappoints
- Exporting unsupported custom operators
- Using one enormous dynamic-shape profile
- Assuming optimization preserved quality without checks
- Rebuilding engines on every startup
Production checklist
- Inspect unsupported nodes and device partitions
- Define realistic shape profiles
- Serialize optimized artifacts when appropriate
- Compare outputs within numerical tolerances
- Profile layer and fusion reports
What to measure
- Node count before and after optimization
- Kernel launches per request
- CPU-GPU transfer count
- Engine build and load time
- Latency by optimization profile
From one GPU to a production service
A developer can rebuild an engine whenever code starts. A production fleet should build once in a controlled pipeline, sign the artifact, scan plugins, attach provenance, and distribute an immutable engine compatible with the target GPU class.
Model validation must compare both semantics and performance. An output can be numerically correct while a lost fusion doubles latency. Conversely, a fast engine can violate task quality through an unsafe precision or approximation choice.
Plan for rollback at artifact level. Keep the previous engine loadable, route canary traffic to the new build, and compare layer reports plus request traces before broad promotion.
Design-review questions
- Is the engine built for the exact runtime and GPU class?
- Which graph partitions or fusions changed from the prior version?
- Are optimization profiles derived from production shapes?
- Can plugins be reproduced and security-reviewed?
- What automatic signal triggers rollback?
How it connects to the rest of the series
Mixed precision and quantized kernels give the optimizer more tactic choices. FlashAttention may appear as a specialized fused operator. Dynamic batching determines the shapes the engine sees.
From equation to implementation
Graph optimization is constrained by semantics, shapes, and side effects. Constant folding is safe only for values known at build time. Fusion requires compatible layouts, precision, and consumers. A single extra consumer of an intermediate can prevent an otherwise obvious fusion.
TensorRT tactic selection benchmarks candidate kernels within a workspace and precision budget. The resulting engine encodes choices for a GPU family and optimization profiles. Reproducible deployment therefore versions the source model, builder version, flags, plugins, calibration data, and target compute capability.
Implementation sketch
export_model_to_onnx(opset, dynamic_axes)
run_onnx_checker_and_shape_inference()
optimize_basic_and_extended_graph()
partition_by_execution_provider()
for profile in production_shape_profiles:
build_tensorrt_engine(profile, precision_constraints)
inspect_fusions_and_fallbacks()
validate_outputs(reference_runtime, optimized_engine)
publish_versioned_engine_artifact()Capacity planning
Builder workspace can be much larger than runtime workspace and should not be confused with steady inference memory. Each shape profile may add tactics or buffers. Limit profiles to useful ranges and build separate engines when workloads are materially different.
Benchmarking without fooling yourself
- Compare cold build, warm load, and steady request latency.
- Dump layer information and count fused versus fallback regions.
- Test every optimization profile boundary and typical shape.
- Use tolerance plus task-level validation for lower precision engines.
A production failure to design for
An ONNX export upgrades opset and represents rotary embedding through an unsupported pattern. TensorRT partitions around a plugin gap, adding device copies. The engine builds successfully but slows down. Gate releases on partition and fusion diffs, not build success.
Primary references
The takeaway
Graph optimization converts a correct model into a hardware plan. The gain comes from global visibility, but so does the responsibility to validate shapes, precision, and portability.
