Skip to content
1/21 - Inside the GPU: From SMs to HBM Without the Hand-Waving

1/21 - Inside the GPU: From SMs to HBM Without the Hand-Waving

A GPU is not a faster CPU. It is a throughput machine whose many streaming multiprocessors hide latency by keeping many warps eligible while a hierarchy of registers, shared memory, caches, and device DRAM feeds them.

The quickest way into this topic is to follow one unit of work and ask where its bytes, authority, and time go. We will build the intuition first, then keep going into capacity math, placement, failure behavior, and the measurements worth putting on an operator dashboard.

The mental model

CUDA’s programming model maps a grid to thread blocks and a block to one SM for its lifetime. Threads execute in warps, but the scheduler issues only warps whose operands are ready. Occupancy is therefore a resource-capacity measure, not a direct performance score: more resident warps help hide latency until instruction dependencies, bandwidth, or cache behavior become the real limit.

1/21 · System anatomyThe four ownership layers that make this part of the AI platform operable.
1/21 · System anatomy
Read from the external contract down to the mechanism that performs the work.
Host and interconnectCPU process, pinned memory, PCIe or NVLink
Device memoryHBM/GDDR holds weights, activations, and KV state
GPU fabricL2 cache and memory controllers serve all SMs
Streaming multiprocessorwarp schedulers, registers, shared memory, tensor cores
Engineering invariant: never optimize arithmetic while unmeasured data movement dominates
The four ownership layers that make this part of the AI platform operable.

Description: The diagram separates the user-visible contract from state placement, execution, and control. Read it top to bottom. A tuning change in a lower layer is safe only when the upper-layer contract remains true.

What actually happens

Registers are private to a thread and extremely fast, but a kernel that uses too many registers reduces the number of resident blocks. Shared memory is explicitly managed at block scope and is useful for tiled reuse; poor bank access patterns serialize traffic. L1 is per-SM, L2 is device-wide, and global memory is the large, high-latency device DRAM. Local memory sounds local but is compiler-managed storage backed by device memory, commonly after register spilling.

A field note

A profiler often reports a busy GPU while useful throughput remains disappointing. That is not a contradiction. Warps can be active while waiting on long dependency chains, replaying memory transactions, or executing work that padding later discards. Start with the roofline question—bytes or math?—then use stall reasons and memory counters to explain why the kernel sits where it does.

The end-to-end critical pathA production request path with the work and evidence carried by each stage.
The end-to-end critical path
Every arrow is latency, state transfer, or an authority boundary.
1 · Allocate
reserve device pageschoose dtypecheck headroom
2 · Transfer
pin host buffersDMA over linkoverlap streams
3 · Schedule
launch thread gridplace blocks on SMstrack dependencies
4 · Execute
issue ready warpstensor/core mathhide stalls
5 · Retire
write global memorysignal eventreturn result
Critical-path accounting
VRAM ~= weights + KV cache + activations + runtime workspace + fragmentation reserve
Optimize measured exposed time; preserve identity, deadlines, and error semantics across every stage.
A production request path with the work and evidence carried by each stage.

Description: Follow one unit of work from left to right. The lower panel is the accounting model. It is intentionally explicit because unmeasured queueing and data movement are the most common reasons that component benchmarks fail to predict production behavior.

The capacity equation

VRAM ~= weights + KV cache + activations + runtime workspace + fragmentation reserve

Treat this as a model to validate, not a constant to copy. Measure each term on the exact hardware, model revision, input distribution, and concurrency regime. Capacity planning should reserve failure headroom; running permanently at the cliff makes recovery impossible when a replica, link, or dependency disappears.

A worked production example

For a 7B-parameter model, BF16 weights alone require roughly 14 GB before KV cache, activations, allocator reserve, and engine workspaces. An 80 GB accelerator does not therefore imply five independent 7B replicas: context length, concurrent sequences, cache dtype, and graph workspaces decide the actual packing. Build a byte ledger first, then validate it with allocator and device telemetry.

Run the experiment in at least three regimes: one request for floor latency, a realistic concurrency distribution for normal operation, and controlled overload for backpressure and recovery. A system is not healthy merely because it eventually completes every request. Queue age, deadline misses, quality, and resource recovery all belong in the acceptance criteria.

Execution timeline and measurement pointsMeasure the transition between stages, not only the total duration.
Execution timeline and measurement points
Throughput improvements are useful only when queueing, quality, and recovery remain bounded.
Host submit
CPU prepares descriptorscopies are asynchronousevents express order
Kernel launch
grid is admittedblocks claim SM resourceswarps become eligible
Steady state
math overlaps memorywarps hide latencycaches reduce HBM traffic
Completion
stores become visibleevent is recordedconsumer may proceed
Measure at every boundary
SM active and achieved occupancy | HBM and L2 throughput | warp stall reasons and register spills | kernel launch and synchronization gaps
Measure the transition between stages, not only the total duration.

Description: The timeline identifies where work waits and where it executes. Instrument both sides of every transition so queue time cannot be mistaken for compute time. Compare steady state with the warm-up and recovery periods rather than deleting them from the report.

Placement, topology, and scale

Logical architecture hides physical asymmetry. Two workers can have the same configuration while differing in accelerator generation, NUMA path, network hops, cache warmth, storage locality, or noisy-neighbor pressure. Placement must therefore be expressed as constraints and verified through telemetry.

Placement and failure-domain topologyTopology determines bandwidth, fault containment, and which state can be recovered locally.
Placement and failure-domain topology
Logical parallelism must be mapped to physical capacity and independent recovery boundaries.
GPU 0
SM groupsprimary work
local HBMresident state
L2 slicesbackpressure
copy enginesevidence
GPU 1
SM groupsprimary work
local HBMresident state
L2 slicesbackpressure
copy enginesevidence
Inter-domain fabric · PCIe / NVLink / NVSwitch
Placement ruleKeep frequently reused data local; measure transfers before assuming compute is the bottleneck.
Topology determines bandwidth, fault containment, and which state can be recovered locally.

Description: The two domains are intentionally independent. Local queues contain transient pressure; durable identity lets work move; the fabric is treated as a finite resource. A cross-domain design should say what happens when the fabric is slow, partitioned, or only partially available.

Failure analysis

The triggering event is rarely the entire incident. Cascades occur when a local failure creates retries, retries create more load, and overloaded dependencies become less responsive. Bound attempts, preserve the original deadline, add jitter, and open circuits by route or failure domain rather than disabling an entire platform.

Failure propagation and containmentOne initiating condition can become a correctness, performance, and operational incident unless boundaries contain it.
Failure propagation and containment
Design the recovery path before increasing concurrency or autonomy.
Trigger · working set exceeds VRAMallocator cannot find a suitable contiguous/block allocation
OOMrequest fails before launchor during cache growth
Thrashpages or tensors movelatency becomes unstable
Low occupancyregister/shared use caps blocksSMs cannot hide stalls
Containment and recoveryBudget every resident byte, bound sequence and batch growth, preallocate stable pools, and alert on allocator fragmentation before OOM.
One initiating condition can become a correctness, performance, and operational incident unless boundaries contain it.

Description: Trace the trigger downward into three distinct consequences. Correctness, performance, and operability require different detection and recovery controls; one generic health check cannot represent all three.

The control loop

Production optimization is a feedback system. Signals must be fresh and correctly scoped; decisions need hysteresis or cooldown; actions need bounds; verification must compare the intended metric without hiding regressions elsewhere. If a controller can add load faster than the system can observe the result, it will oscillate.

The production control loopA stable control loop changes bounded inputs and verifies the result against a baseline.
The production control loop
Observe, decide, actuate, and verify without letting the controller oscillate.
SLO controllerpolicy + state
SignalsHBM bandwidth; SM activewarp stall reasons
Decisionmemory-bound or compute-boundadmit, tile, or fuse
Actuationchange batch/precisionmove or fuse operations
Verificationprofile representative shapescompare p50 and p99
Safety invariant: never optimize arithmetic while unmeasured data movement dominates
A stable control loop changes bounded inputs and verifies the result against a baseline.

Description: A safe controller closes the loop. It does not stop after changing a batch size, replica count, route weight, or precision. It checks quality and SLOs, attributes the outcome, and rolls back when the invariant is violated.

What to measure

  • SM active and achieved occupancy
  • HBM and L2 throughput
  • warp stall reasons and register spills
  • kernel launch and synchronization gaps
  • allocated, reserved, and fragmented VRAM

Always segment these measurements by model revision, workload class, hardware type, and outcome. A fleet-wide average can look healthy while one tenant, long-context bucket, adapter, or accelerator generation is failing.

From laboratory result to production capability

A laboratory result proves that one configuration worked once. A production capability proves that the same contract survives concurrency, skew, partial failure, deployment, and rollback. Record the complete experiment envelope: hardware SKU and topology, driver and runtime versions, model and tokenizer digests, request distribution, warm-up policy, concurrency, precision, and every non-default control. Without that envelope, a performance number is not reproducible evidence.

Separate floor latency, sustainable throughput, and recovery capacity. Floor latency is measured with no queue. Sustainable throughput is the highest rate that keeps queue age and SLO violations bounded over a long run. Recovery capacity is spare work the system can absorb after a replica, link, node, or dependency is lost. These are different numbers. Peak throughput is usually above the sustainable point and says little about safe production capacity.

Roll out in stages. First shadow inputs where policy permits, then canary a narrow workload slice, then increase traffic while comparing quality and operational distributions with the baseline. Make the rollback trigger machine-readable before rollout begins. A rollback that requires an operator to rediscover the previous model, state schema, or runtime image is not a rollback plan.

Debugging order

Debug from the outside inward. Confirm the request identity and deadline, then measure admission and queueing, then state lookup or transfer, then execution, then serialization and downstream delivery. Correlate all five with one trace identity. This order prevents a common mistake: optimizing the most visible kernel while the actual delay is a queue, a copy, a collective, a storage read, or a retry outside the profiler window.

Change one independent variable at a time and retain the raw samples. If a change improves the median but damages the p99, quality, or recovery time, it is not an unconditional improvement. Explain which workload segment benefits and encode that scope in routing or policy instead of applying the change globally.

Design-review checklist

  • Is every artifact and state transition bound to a stable version or digest?
  • Where does work wait, what bounds that queue, and what happens at the bound?
  • Which failures are retryable, and how are deadline and idempotency preserved?
  • Which resource saturates first under representative load?
  • Can operators distinguish correctness failure from overload and dependency failure?
  • Does rollback restore both code and state compatibility?
  • Are sensitive inputs, outputs, credentials, and telemetry scoped and redacted?
  • Has the recovery path been tested under partial failure rather than described only on paper?

Primary and official references

The takeaway

A GPU is not a faster CPU. It is a throughput machine whose many streaming multiprocessors hide latency by keeping many warps eligible while a hierarchy of registers, shared memory, caches, and device DRAM feeds them. The engineering discipline is to make that claim measurable: define the contract, map state and work to real resources, test the failure boundary, and operate a feedback loop that protects correctness before chasing peak throughput.