Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model

#onnx #tensorrt #graph-optimization #kernel-fusion #inference

A framework executes the model you wrote. An optimizing runtime tries to execute the model you meant. By seeing the graph as a whole, it can erase redundant work, fold constants, fuse operators, select layouts, and build kernels that no individual layer requested.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Editing a travel itinerary one trip at a time misses that three errands share the same route. Graph optimization lays the whole map on the table and combines compatible journeys.

Follow the state and work from left to right.

What actually happens

Basic rewrites remove identities, unused nodes, redundant casts, and constant subgraphs. Extended passes recognize patterns such as MatMul plus bias plus GELU, attention blocks, or skip-layer normalization and replace them with fused implementations.

ONNX Runtime can optimize online when a session starts or serialize an offline-optimized model. TensorRT builds an engine for target hardware, shapes, and precision constraints, timing candidate tactics before selecting implementations.

Graph partitioning matters when multiple execution providers are present. Unsupported operators create boundaries, device copies, and lost fusion opportunities. A graph that is mostly on GPU can still be slow if a small CPU island forces synchronization.

A worked example

Suppose a transformer MLP appears as MatMul, Add, GELU, MatMul, Add. Five launches materialize intermediates. A supported fusion can combine bias and activation work into fewer kernels, reducing launch overhead and HBM round trips while preserving semantics.

The performance model

A useful model is optimized latency = kernel work plus launches plus memory traffic plus device transitions. Graph rewrites attack all four, but engine build time and hardware specificity move cost to deployment.

Expert lens

Dynamic shapes widen tactic choices but can prevent specialization. Production engines often define optimization profiles with minimum, typical, and maximum shapes. Profiles should reflect real prompt and batch distributions, not arbitrary extremes.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Stable production models and shape ranges
Graphs with fusible transformer patterns
Deployments that can cache target-specific engines

Where it disappoints

Exporting unsupported custom operators
Using one enormous dynamic-shape profile
Assuming optimization preserved quality without checks
Rebuilding engines on every startup

Production checklist

Inspect unsupported nodes and device partitions
Define realistic shape profiles
Serialize optimized artifacts when appropriate
Compare outputs within numerical tolerances
Profile layer and fusion reports

What to measure

Node count before and after optimization
Kernel launches per request
CPU-GPU transfer count
Engine build and load time
Latency by optimization profile

From one GPU to a production service

A developer can rebuild an engine whenever code starts. A production fleet should build once in a controlled pipeline, sign the artifact, scan plugins, attach provenance, and distribute an immutable engine compatible with the target GPU class.

Model validation must compare both semantics and performance. An output can be numerically correct while a lost fusion doubles latency. Conversely, a fast engine can violate task quality through an unsafe precision or approximation choice.

Plan for rollback at artifact level. Keep the previous engine loadable, route canary traffic to the new build, and compare layer reports plus request traces before broad promotion.

Design-review questions

Is the engine built for the exact runtime and GPU class?
Which graph partitions or fusions changed from the prior version?
Are optimization profiles derived from production shapes?
Can plugins be reproduced and security-reviewed?
What automatic signal triggers rollback?

How it connects to the rest of the series

Mixed precision and quantized kernels give the optimizer more tactic choices. FlashAttention may appear as a specialized fused operator. Dynamic batching determines the shapes the engine sees.

From equation to implementation

Graph optimization is constrained by semantics, shapes, and side effects. Constant folding is safe only for values known at build time. Fusion requires compatible layouts, precision, and consumers. A single extra consumer of an intermediate can prevent an otherwise obvious fusion.

TensorRT tactic selection benchmarks candidate kernels within a workspace and precision budget. The resulting engine encodes choices for a GPU family and optimization profiles. Reproducible deployment therefore versions the source model, builder version, flags, plugins, calibration data, and target compute capability.

Implementation sketch

export_model_to_onnx(opset, dynamic_axes)
run_onnx_checker_and_shape_inference()
optimize_basic_and_extended_graph()
partition_by_execution_provider()
for profile in production_shape_profiles:
    build_tensorrt_engine(profile, precision_constraints)
    inspect_fusions_and_fallbacks()
validate_outputs(reference_runtime, optimized_engine)
publish_versioned_engine_artifact()

Capacity planning

Builder workspace can be much larger than runtime workspace and should not be confused with steady inference memory. Each shape profile may add tactics or buffers. Limit profiles to useful ranges and build separate engines when workloads are materially different.

Benchmarking without fooling yourself

Compare cold build, warm load, and steady request latency.
Dump layer information and count fused versus fallback regions.
Test every optimization profile boundary and typical shape.
Use tolerance plus task-level validation for lower precision engines.

A production failure to design for

An ONNX export upgrades opset and represents rotary embedding through an unsupported pattern. TensorRT partitions around a plugin gap, adding device copies. The engine builds successfully but slows down. Gate releases on partition and fusion diffs, not build success.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Graph optimization converts a correct model into a hardware plan. The gain comes from global visibility, but so does the responsibility to validate shapes, precision, and portability.

Sequence Parallelism: Divide the Tokens, Not the Meaning Dynamic Batching: Waiting Microseconds to Save Milliseconds