Memory Offloading: Trading Bandwidth for Capacity

#memory-offloading #gpu-memory #cpu #nvme #llm-inference

GPU memory is fast and scarce. CPU memory is larger and slower. NVMe is larger again and slower still. Memory offloading lets a model or cache exceed HBM capacity by turning the memory hierarchy into an explicit part of inference.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A chef keeps today’s ingredients on the counter, tomorrow’s in the refrigerator, and bulk stock in the storeroom. Capacity grows, but every trip away from the counter adds delay unless the next ingredient is fetched early.

Follow the state and work from left to right.

What actually happens

Weight offloading streams model layers from CPU RAM or NVMe into GPU memory, computes the layer, and releases or replaces it. Prefetching overlaps transfer of the next layer with compute on the current layer.

KV offloading moves cold or reusable cache blocks out of HBM while active decode blocks remain local. Reloading a prefix can still beat recomputing it, but only when transfer latency and bandwidth are favorable.

Pinned host memory enables faster DMA but consumes a limited operating-system resource. NUMA placement, PCIe generation, Grace Hopper coherent links, NVMe queue depth, and concurrent traffic all change the result.

A worked example

A 70B BF16 model needs about 140 GB of weight storage. A single 80 GB GPU cannot hold it plus KV cache. Offloading can stream weights from 256 GB host memory, but every generated token may revisit all layers. Without prefetch and a large enough batch to amortize transfers, TPOT becomes dominated by PCIe.

The performance model

Offloading is a capacity technique first. Throughput depends on whether transfer is hidden behind compute. A roofline-style check compares bytes transferred per step with effective link bandwidth and compute time available for overlap.

Expert lens

Weights and KV cache have different reuse patterns. Weights are read in a deterministic layer order every step; KV blocks are request-specific and may be cold for long periods. They need different prefetch and eviction policies.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Models or contexts that otherwise cannot fit
Batch throughput that amortizes weight transfer
Reusable KV prefixes on fast CPU-GPU links

Where it disappoints

Assuming capacity gain implies speed gain
Over-allocating pinned host memory
Ignoring NUMA and PCIe topology
Reloading KV blocks slower than recomputation

Production checklist

Measure effective transfer bandwidth
Separate weight and KV policies
Pin and place host memory deliberately
Prefetch against profiled compute windows
Test eviction storms under burst load

What to measure

HBM, host, and NVMe occupancy
Offload and reload bytes per second
Transfer overlap percentage
Cache reload versus recompute latency
Page faults, pinned bytes, and queue depth

From one GPU to a production service

One process can treat host RAM as private. A production node may run multiple model replicas, networking agents, and storage daemons. A node-level allocator should assign pinned and pageable host budgets rather than letting each engine reserve optimistically.

Remote or distributed KV tiers add consistency and security. Blocks need model identity, tenant salt, checksum, and expiry. A cache tier outage should cause recomputation or a controlled miss, not make inference unavailable.

Autoscaling offloaded models is slower because warmup includes weight movement and cache initialization. Keep warm capacity, pre-stage artifacts, and expose readiness only after the first production-shaped path succeeds.

Design-review questions

Who coordinates offload memory across processes?
What traffic is allowed to depend on remote cache?
Is reload faster than recompute for each block class?
How long does a cold replica take to become useful?
What happens when CPU or NVMe bandwidth is saturated?

How it connects to the rest of the series

KV caching supplies offloadable state. PagedAttention defines cache blocks. Prefill-decode disaggregation transfers KV between GPU pools, while mixed precision reduces every tier’s byte count.

From equation to implementation

For weight streaming, the necessary condition for fully hidden transfer is transfer_time(next layer) <= compute_time(current layer). If a layer has W bytes and effective link bandwidth B, transfer takes at least W/B before software overhead. Small batches shorten compute and make hiding harder.

KV offload has a different decision: reload cost versus recompute cost versus cache miss penalty. A retention score can combine expected reuse probability, reload bytes, recompute FLOPs, tenant priority, and recency instead of using plain LRU.

Implementation sketch

for layer in model_layers:
    await prefetched[layer]
    launch_prefetch(layer + lookahead, target=inactive_buffer)
    output = run_layer(layer, input, active_buffer)
    release_or_demote(layer, active_buffer)
for kv_block in eviction_candidates:
    if expected_reload_value(kv_block) > offload_cost(kv_block):
        copy_async(kv_block, host_cache)
    else:
        evict(kv_block)

Capacity planning

Reserve host memory for the operating system, page cache, network buffers, and other workers. Pinned memory is not ordinary RAM: excessive pinning can harm the node. NVMe endurance and shared bandwidth also matter for sustained batch workloads.

Benchmarking without fooling yourself

Measure effective bandwidth with concurrent production-like transfers.
Compare no offload, CPU offload, and NVMe offload at equal quality.
Sweep batch size to reveal transfer amortization.
Record compute-transfer overlap and GPU idle gaps.

A production failure to design for

Several replicas on one node independently allocate huge pinned host caches. The node remains below nominal RAM capacity but network and storage buffers cannot allocate, causing broad instability. Coordinate offload budgets at node level.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Offloading lets capacity exceed HBM, but bandwidth becomes the new budget. A good design knows exactly which transfer is hidden and which one sits on the token path.

Dynamic Batching: Waiting Microseconds to Save Milliseconds Streaming Generation: The First Token Is a Product Decision