15/20 - Memory Offloading: Trading Bandwidth for Capacity
GPU memory is fast and scarce. CPU memory is larger and slower. NVMe is larger again and slower still. Memory offloading lets a model or cache exceed HBM capacity by turning the memory hierarchy into an explicit part of inference.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A chef keeps today’s ingredients on the counter, tomorrow’s in the refrigerator, and bulk stock in the storeroom. Capacity grows, but every trip away from the counter adds delay unless the next ingredient is fetched early.
How to read this diagram: Start with GPU HBM, where use active weights. The middle stage, Offload manager, prefetch next data. The final stage, CPU or NVMe, shows the observable result: store larger state. The arrows describe dependency order, not necessarily separate services.
What actually happens
Weight offloading streams model layers from CPU RAM or NVMe into GPU memory, computes the layer, and releases or replaces it. Prefetching overlaps transfer of the next layer with compute on the current layer.
KV offloading moves cold or reusable cache blocks out of HBM while active decode blocks remain local. Reloading a prefix can still beat recomputing it, but only when transfer latency and bandwidth are favorable.
Pinned host memory enables faster DMA but consumes a limited operating-system resource. NUMA placement, PCIe generation, Grace Hopper coherent links, NVMe queue depth, and concurrent traffic all change the result.
A worked example
A 70B BF16 model needs about 140 GB of weight storage. A single 80 GB GPU cannot hold it plus KV cache. Offloading can stream weights from 256 GB host memory, but every generated token may revisit all layers. Without prefetch and a large enough batch to amortize transfers, TPOT becomes dominated by PCIe.
The performance model
Offloading is a capacity technique first. Throughput depends on whether transfer is hidden behind compute. A roofline-style check compares bytes transferred per step with effective link bandwidth and compute time available for overlap.
How to read this diagram: The left panel asks how Memory offloading changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.
Expert lens
Weights and KV cache have different reuse patterns. Weights are read in a deterministic layer order every step; KV blocks are request-specific and may be cold for long periods. They need different prefetch and eviction policies.
How to read this diagram: The left panel is the baseline, HBM-only serving, characterized by lowest access latency and strict capacity ceiling. The right panel applies Tiered offloading, changing the cost profile to larger effective capacity and transfer and prefetch cost. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.
Where it wins
- Models or contexts that otherwise cannot fit
- Batch throughput that amortizes weight transfer
- Reusable KV prefixes on fast CPU-GPU links
Where it disappoints
- Assuming capacity gain implies speed gain
- Over-allocating pinned host memory
- Ignoring NUMA and PCIe topology
- Reloading KV blocks slower than recomputation
Production checklist
- Measure effective transfer bandwidth
- Separate weight and KV policies
- Pin and place host memory deliberately
- Prefetch against profiled compute windows
- Test eviction storms under burst load
What to measure
- HBM, host, and NVMe occupancy
- Offload and reload bytes per second
- Transfer overlap percentage
- Cache reload versus recompute latency
- Page faults, pinned bytes, and queue depth
From one GPU to a production service
One process can treat host RAM as private. A production node may run multiple model replicas, networking agents, and storage daemons. A node-level allocator should assign pinned and pageable host budgets rather than letting each engine reserve optimistically.
Remote or distributed KV tiers add consistency and security. Blocks need model identity, tenant salt, checksum, and expiry. A cache tier outage should cause recomputation or a controlled miss, not make inference unavailable.
Autoscaling offloaded models is slower because warmup includes weight movement and cache initialization. Keep warm capacity, pre-stage artifacts, and expose readiness only after the first production-shaped path succeeds.
Design-review questions
- Who coordinates offload memory across processes?
- What traffic is allowed to depend on remote cache?
- Is reload faster than recompute for each block class?
- How long does a cold replica take to become useful?
- What happens when CPU or NVMe bandwidth is saturated?
How it connects to the rest of the series
KV caching supplies offloadable state. PagedAttention defines cache blocks. Prefill-decode disaggregation transfers KV between GPU pools, while mixed precision reduces every tier’s byte count.
From equation to implementation
For weight streaming, the necessary condition for fully hidden transfer is transfer_time(next layer) <= compute_time(current layer). If a layer has W bytes and effective link bandwidth B, transfer takes at least W/B before software overhead. Small batches shorten compute and make hiding harder.
KV offload has a different decision: reload cost versus recompute cost versus cache miss penalty. A retention score can combine expected reuse probability, reload bytes, recompute FLOPs, tenant priority, and recency instead of using plain LRU.
Implementation sketch
for layer in model_layers:
await prefetched[layer]
launch_prefetch(layer + lookahead, target=inactive_buffer)
output = run_layer(layer, input, active_buffer)
release_or_demote(layer, active_buffer)
for kv_block in eviction_candidates:
if expected_reload_value(kv_block) > offload_cost(kv_block):
copy_async(kv_block, host_cache)
else:
evict(kv_block)Capacity planning
Reserve host memory for the operating system, page cache, network buffers, and other workers. Pinned memory is not ordinary RAM: excessive pinning can harm the node. NVMe endurance and shared bandwidth also matter for sustained batch workloads.
Benchmarking without fooling yourself
- Measure effective bandwidth with concurrent production-like transfers.
- Compare no offload, CPU offload, and NVMe offload at equal quality.
- Sweep batch size to reveal transfer amortization.
- Record compute-transfer overlap and GPU idle gaps.
A production failure to design for
Several replicas on one node independently allocate huge pinned host caches. The node remains below nominal RAM capacity but network and storage buffers cannot allocate, causing broad instability. Coordinate offload budgets at node level.
How to read this diagram: The operating cycle moves from Inventory to Classify, then Overlap and Control. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.
Deeper engineering guide
Offloading creates a hierarchy: HBM for imminent compute, host DRAM for warm state, local NVMe for colder state, and sometimes remote storage for durable artifacts. Moving a tensor is worthwhile only when transfer plus restore costs less than recomputation and completes before the request deadline.
How to read this diagram: Follow the state from Classify through Evict and Prefetch to Consume. Each box is an ownership or computation boundary. In particular, every move needs a deadline, bandwidth budget, and cancellation path. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.
Bandwidth is shared. Aggressive KV spill can compete with model loading, collectives, storage writes, and active decode. Limit concurrent transfers per link and prioritize state required by near-deadline requests. Pinned host memory improves DMA but is finite and affects the operating system.
How to read this diagram: The bars compare HBM-only with HBM + host tier on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Cold state stops evicting immediately useful tensors.”, remains larger than the risk, “A late prefetch stalls the compute stream on transfer.”, under production traffic.
Eviction value should include bytes, predicted reuse, restore latency, and recompute time. Plain LRU can retain a huge cheap-to-recompute tensor while evicting a small expensive prefix. Admission to a lower tier also needs limits or the tier becomes a slower unbounded queue.
How to read this diagram: State advances from Resident to Migrating, Cold, and finally Restoring. The labels below each state identify what becomes true at that boundary. The governing invariant is: Only one authoritative version is writable; transfers publish atomically. Retries and cancellation must preserve the same transition rules.
How to read this diagram: The four panels are independent review axes: Reuse, Cost, Pressure, and Correctness. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Report bytes moved, useful prefetches, late restores, and avoided recompute.
How to read this diagram: This is a causal chain, not four unrelated symptoms. HBM fills triggers Prefetches wait, which creates Queues grow. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.
Primary references
The takeaway
Offloading lets capacity exceed HBM, but bandwidth becomes the new budget. A good design knows exactly which transfer is hidden and which one sits on the token path.
