Memory Offloading: Trading Bandwidth for Capacity
GPU memory is fast and scarce. CPU memory is larger and slower. NVMe is larger again and slower still. Memory offloading lets a model or cache exceed HBM capacity by turning the memory hierarchy into an explicit part of inference.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A chef keeps today’s ingredients on the counter, tomorrow’s in the refrigerator, and bulk stock in the storeroom. Capacity grows, but every trip away from the counter adds delay unless the next ingredient is fetched early.
What actually happens
Weight offloading streams model layers from CPU RAM or NVMe into GPU memory, computes the layer, and releases or replaces it. Prefetching overlaps transfer of the next layer with compute on the current layer.
KV offloading moves cold or reusable cache blocks out of HBM while active decode blocks remain local. Reloading a prefix can still beat recomputing it, but only when transfer latency and bandwidth are favorable.
Pinned host memory enables faster DMA but consumes a limited operating-system resource. NUMA placement, PCIe generation, Grace Hopper coherent links, NVMe queue depth, and concurrent traffic all change the result.
A worked example
A 70B BF16 model needs about 140 GB of weight storage. A single 80 GB GPU cannot hold it plus KV cache. Offloading can stream weights from 256 GB host memory, but every generated token may revisit all layers. Without prefetch and a large enough batch to amortize transfers, TPOT becomes dominated by PCIe.
The performance model
Offloading is a capacity technique first. Throughput depends on whether transfer is hidden behind compute. A roofline-style check compares bytes transferred per step with effective link bandwidth and compute time available for overlap.
Expert lens
Weights and KV cache have different reuse patterns. Weights are read in a deterministic layer order every step; KV blocks are request-specific and may be cold for long periods. They need different prefetch and eviction policies.
Where it wins
- Models or contexts that otherwise cannot fit
- Batch throughput that amortizes weight transfer
- Reusable KV prefixes on fast CPU-GPU links
Where it disappoints
- Assuming capacity gain implies speed gain
- Over-allocating pinned host memory
- Ignoring NUMA and PCIe topology
- Reloading KV blocks slower than recomputation
Production checklist
- Measure effective transfer bandwidth
- Separate weight and KV policies
- Pin and place host memory deliberately
- Prefetch against profiled compute windows
- Test eviction storms under burst load
What to measure
- HBM, host, and NVMe occupancy
- Offload and reload bytes per second
- Transfer overlap percentage
- Cache reload versus recompute latency
- Page faults, pinned bytes, and queue depth
From one GPU to a production service
One process can treat host RAM as private. A production node may run multiple model replicas, networking agents, and storage daemons. A node-level allocator should assign pinned and pageable host budgets rather than letting each engine reserve optimistically.
Remote or distributed KV tiers add consistency and security. Blocks need model identity, tenant salt, checksum, and expiry. A cache tier outage should cause recomputation or a controlled miss, not make inference unavailable.
Autoscaling offloaded models is slower because warmup includes weight movement and cache initialization. Keep warm capacity, pre-stage artifacts, and expose readiness only after the first production-shaped path succeeds.
Design-review questions
- Who coordinates offload memory across processes?
- What traffic is allowed to depend on remote cache?
- Is reload faster than recompute for each block class?
- How long does a cold replica take to become useful?
- What happens when CPU or NVMe bandwidth is saturated?
How it connects to the rest of the series
KV caching supplies offloadable state. PagedAttention defines cache blocks. Prefill-decode disaggregation transfers KV between GPU pools, while mixed precision reduces every tier’s byte count.
From equation to implementation
For weight streaming, the necessary condition for fully hidden transfer is transfer_time(next layer) <= compute_time(current layer). If a layer has W bytes and effective link bandwidth B, transfer takes at least W/B before software overhead. Small batches shorten compute and make hiding harder.
KV offload has a different decision: reload cost versus recompute cost versus cache miss penalty. A retention score can combine expected reuse probability, reload bytes, recompute FLOPs, tenant priority, and recency instead of using plain LRU.
Implementation sketch
for layer in model_layers:
await prefetched[layer]
launch_prefetch(layer + lookahead, target=inactive_buffer)
output = run_layer(layer, input, active_buffer)
release_or_demote(layer, active_buffer)
for kv_block in eviction_candidates:
if expected_reload_value(kv_block) > offload_cost(kv_block):
copy_async(kv_block, host_cache)
else:
evict(kv_block)Capacity planning
Reserve host memory for the operating system, page cache, network buffers, and other workers. Pinned memory is not ordinary RAM: excessive pinning can harm the node. NVMe endurance and shared bandwidth also matter for sustained batch workloads.
Benchmarking without fooling yourself
- Measure effective bandwidth with concurrent production-like transfers.
- Compare no offload, CPU offload, and NVMe offload at equal quality.
- Sweep batch size to reveal transfer amortization.
- Record compute-transfer overlap and GPU idle gaps.
A production failure to design for
Several replicas on one node independently allocate huge pinned host caches. The node remains below nominal RAM capacity but network and storage buffers cannot allocate, causing broad instability. Coordinate offload budgets at node level.
Primary references
The takeaway
Offloading lets capacity exceed HBM, but bandwidth becomes the new budget. A good design knows exactly which transfer is hidden and which one sits on the token path.
