15/20 - Memory Offloading: Trading Bandwidth for Capacity

#memory-offloading #gpu-memory #cpu #nvme #llm-inference

GPU memory is fast and scarce. CPU memory is larger and slower. NVMe is larger again and slower still. Memory offloading lets a model or cache exceed HBM capacity by turning the memory hierarchy into an explicit part of inference.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A chef keeps today’s ingredients on the counter, tomorrow’s in the refrigerator, and bulk stock in the storeroom. Capacity grows, but every trip away from the counter adds delay unless the next ingredient is fetched early.

Follow the state and work from left to right.

Description: Start with GPU HBM, where use active weights. The middle stage, Offload manager, prefetch next data. The final stage, CPU or NVMe, shows the observable result: store larger state. The arrows describe dependency order, not necessarily separate services.

What actually happens

Weight offloading streams model layers from CPU RAM or NVMe into GPU memory, computes the layer, and releases or replaces it. Prefetching overlaps transfer of the next layer with compute on the current layer.

KV offloading moves cold or reusable cache blocks out of HBM while active decode blocks remain local. Reloading a prefix can still beat recomputing it, but only when transfer latency and bandwidth are favorable.

Pinned host memory enables faster DMA but consumes a limited operating-system resource. NUMA placement, PCIe generation, Grace Hopper coherent links, NVMe queue depth, and concurrent traffic all change the result.

A worked example

A 70B BF16 model needs about 140 GB of weight storage. A single 80 GB GPU cannot hold it plus KV cache. Offloading can stream weights from 256 GB host memory, but every generated token may revisit all layers. Without prefetch and a large enough batch to amortize transfers, TPOT becomes dominated by PCIe.

The performance model

Offloading is a capacity technique first. Throughput depends on whether transfer is hidden behind compute. A roofline-style check compares bytes transferred per step with effective link bandwidth and compute time available for overlap.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Memory offloading changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Weights and KV cache have different reuse patterns. Weights are read in a deterministic layer order every step; KV blocks are request-specific and may be cold for long periods. They need different prefetch and eviction policies.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, HBM-only serving, characterized by lowest access latency and strict capacity ceiling. The right panel applies Tiered offloading, changing the cost profile to larger effective capacity and transfer and prefetch cost. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Models or contexts that otherwise cannot fit
Batch throughput that amortizes weight transfer
Reusable KV prefixes on fast CPU-GPU links

Where it disappoints

Assuming capacity gain implies speed gain
Over-allocating pinned host memory
Ignoring NUMA and PCIe topology
Reloading KV blocks slower than recomputation

Production checklist

Measure effective transfer bandwidth
Separate weight and KV policies
Pin and place host memory deliberately
Prefetch against profiled compute windows
Test eviction storms under burst load

What to measure

HBM, host, and NVMe occupancy
Offload and reload bytes per second
Transfer overlap percentage
Cache reload versus recompute latency
Page faults, pinned bytes, and queue depth

From one GPU to a production service

One process can treat host RAM as private. A production node may run multiple model replicas, networking agents, and storage daemons. A node-level allocator should assign pinned and pageable host budgets rather than letting each engine reserve optimistically.

Remote or distributed KV tiers add consistency and security. Blocks need model identity, tenant salt, checksum, and expiry. A cache tier outage should cause recomputation or a controlled miss, not make inference unavailable.

Autoscaling offloaded models is slower because warmup includes weight movement and cache initialization. Keep warm capacity, pre-stage artifacts, and expose readiness only after the first production-shaped path succeeds.

Design-review questions

Who coordinates offload memory across processes?
What traffic is allowed to depend on remote cache?
Is reload faster than recompute for each block class?
How long does a cold replica take to become useful?
What happens when CPU or NVMe bandwidth is saturated?

How it connects to the rest of the series

KV caching supplies offloadable state. PagedAttention defines cache blocks. Prefill-decode disaggregation transfers KV between GPU pools, while mixed precision reduces every tier’s byte count.

From equation to implementation

For weight streaming, the necessary condition for fully hidden transfer is transfer_time(next layer) <= compute_time(current layer). If a layer has W bytes and effective link bandwidth B, transfer takes at least W/B before software overhead. Small batches shorten compute and make hiding harder.

KV offload has a different decision: reload cost versus recompute cost versus cache miss penalty. A retention score can combine expected reuse probability, reload bytes, recompute FLOPs, tenant priority, and recency instead of using plain LRU.

Implementation sketch

for layer in model_layers:
    await prefetched[layer]
    launch_prefetch(layer + lookahead, target=inactive_buffer)
    output = run_layer(layer, input, active_buffer)
    release_or_demote(layer, active_buffer)
for kv_block in eviction_candidates:
    if expected_reload_value(kv_block) > offload_cost(kv_block):
        copy_async(kv_block, host_cache)
    else:
        evict(kv_block)

Capacity planning

Reserve host memory for the operating system, page cache, network buffers, and other workers. Pinned memory is not ordinary RAM: excessive pinning can harm the node. NVMe endurance and shared bandwidth also matter for sustained batch workloads.

Benchmarking without fooling yourself

Measure effective bandwidth with concurrent production-like transfers.
Compare no offload, CPU offload, and NVMe offload at equal quality.
Sweep batch size to reveal transfer amortization.
Record compute-transfer overlap and GPU idle gaps.

A production failure to design for

Several replicas on one node independently allocate huge pinned host caches. The node remains below nominal RAM capacity but network and storage buffers cannot allocate, causing broad instability. Coordinate offload budgets at node level.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Inventory to Classify, then Overlap and Control. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Offloading creates a hierarchy: HBM for imminent compute, host DRAM for warm state, local NVMe for colder state, and sometimes remote storage for durable artifacts. Moving a tensor is worthwhile only when transfer plus restore costs less than recomputation and completes before the request deadline.

Offloading succeeds when placement anticipates future use rather than reacting after a miss.

Description: Follow the state from Classify through Evict and Prefetch to Consume. Each box is an ownership or computation boundary. In particular, every move needs a deadline, bandwidth budget, and cancellation path. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Bandwidth is shared. Aggressive KV spill can compete with model loading, collectives, storage writes, and active decode. Limit concurrent transfers per link and prioritize state required by near-deadline requests. Pinned host memory improves DMA but is finite and affects the operating system.

The right tier depends on reuse distance and recomputation cost.

Description: The bars compare HBM-only with HBM + host tier on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Cold state stops evicting immediately useful tensors.”, remains larger than the risk, “A late prefetch stalls the compute stream on transfer.”, under production traffic.

Eviction value should include bytes, predicted reuse, restore latency, and recompute time. Plain LRU can retain a huge cheap-to-recompute tensor while evicting a small expensive prefix. Admission to a lower tier also needs limits or the tier becomes a slower unbounded queue.

Explicit ownership prevents stale reads during overlapping migration and compute.

Description: State advances from Resident to Migrating, Cold, and finally Restoring. The labels below each state identify what becomes true at that boundary. The governing invariant is: Only one authoritative version is writable; transfers publish atomically. Retries and cancellation must preserve the same transition rules.

Tiering policy should optimize useful work, not cache occupancy.

Description: The four panels are independent review axes: Reuse, Cost, Pressure, and Correctness. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Report bytes moved, useful prefetches, late restores, and avoided recompute.

Offload must degrade into extra compute, not an availability dependency.

Description: This is a causal chain, not four unrelated symptoms. HBM fills triggers Prefetches wait, which creates Queues grow. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Offloading lets capacity exceed HBM, but bandwidth becomes the new budget. A good design knows exactly which transfer is hidden and which one sits on the token path.

14/20 - Dynamic Batching: Waiting Microseconds to Save Milliseconds 16/20 - Streaming Generation: The First Token Is a Product Decision