Skip to content
15/20 - Memory Offloading: Trading Bandwidth for Capacity

15/20 - Memory Offloading: Trading Bandwidth for Capacity

GPU memory is fast and scarce. CPU memory is larger and slower. NVMe is larger again and slower still. Memory offloading lets a model or cache exceed HBM capacity by turning the memory hierarchy into an explicit part of inference.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A chef keeps today’s ingredients on the counter, tomorrow’s in the refrigerator, and bulk stock in the storeroom. Capacity grows, but every trip away from the counter adds delay unless the next ingredient is fetched early.

MECHANISM FLOWMemory Offloading: request path01GPU HBMUse active weightsKeep hot KV blocks02Offload managerPrefetch next dataEvict cold data03CPU or NVMeStore larger stateReturn data on demandINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with GPU HBM, where use active weights. The middle stage, Offload manager, prefetch next data. The final stage, CPU or NVMe, shows the observable result: store larger state. The arrows describe dependency order, not necessarily separate services.

What actually happens

Weight offloading streams model layers from CPU RAM or NVMe into GPU memory, computes the layer, and releases or replaces it. Prefetching overlaps transfer of the next layer with compute on the current layer.

KV offloading moves cold or reusable cache blocks out of HBM while active decode blocks remain local. Reloading a prefix can still beat recomputing it, but only when transfer latency and bandwidth are favorable.

Pinned host memory enables faster DMA but consumes a limited operating-system resource. NUMA placement, PCIe generation, Grace Hopper coherent links, NVMe queue depth, and concurrent traffic all change the result.

A worked example

A 70B BF16 model needs about 140 GB of weight storage. A single 80 GB GPU cannot hold it plus KV cache. Offloading can stream weights from 256 GB host memory, but every generated token may revisit all layers. Without prefetch and a large enough batch to amortize transfers, TPOT becomes dominated by PCIe.

The performance model

Offloading is a capacity technique first. Throughput depends on whether transfer is hidden behind compute. A roofline-style check compares bytes transferred per step with effective link bandwidth and compute time available for overlap.

PHASE FITWhere Memory offloading changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityMoves or retains produced prompt stateDECODEOne new token per iterationWeight and KV bandwidth pressureLate restores directly stall tokensPROVE IT WITHUseful prefetch and restore latencyDEPLOYMENT DECISIONTier only when reload beats recompute
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Memory offloading changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Weights and KV cache have different reuse patterns. Weights are read in a deterministic layer order every step; KV blocks are request-specific and may be cold for long periods. They need different prefetch and eviction policies.

TRADE-OFF MAPMemory Offloading: the tradeoffBASELINEHBM-only servingLowest access latencyStrict capacity ceilingSimple ownershipHigh GPU memory costVSOPTIMIZEDTiered offloadingLarger effective capacityTransfer and prefetch costComplex eviction policyUses CPU or NVMe tiersMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, HBM-only serving, characterized by lowest access latency and strict capacity ceiling. The right panel applies Tiered offloading, changing the cost profile to larger effective capacity and transfer and prefetch cost. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Models or contexts that otherwise cannot fit
  • Batch throughput that amortizes weight transfer
  • Reusable KV prefixes on fast CPU-GPU links

Where it disappoints

  • Assuming capacity gain implies speed gain
  • Over-allocating pinned host memory
  • Ignoring NUMA and PCIe topology
  • Reloading KV blocks slower than recomputation

Production checklist

  • Measure effective transfer bandwidth
  • Separate weight and KV policies
  • Pin and place host memory deliberately
  • Prefetch against profiled compute windows
  • Test eviction storms under burst load

What to measure

  • HBM, host, and NVMe occupancy
  • Offload and reload bytes per second
  • Transfer overlap percentage
  • Cache reload versus recompute latency
  • Page faults, pinned bytes, and queue depth

From one GPU to a production service

One process can treat host RAM as private. A production node may run multiple model replicas, networking agents, and storage daemons. A node-level allocator should assign pinned and pageable host budgets rather than letting each engine reserve optimistically.

Remote or distributed KV tiers add consistency and security. Blocks need model identity, tenant salt, checksum, and expiry. A cache tier outage should cause recomputation or a controlled miss, not make inference unavailable.

Autoscaling offloaded models is slower because warmup includes weight movement and cache initialization. Keep warm capacity, pre-stage artifacts, and expose readiness only after the first production-shaped path succeeds.

Design-review questions

  • Who coordinates offload memory across processes?
  • What traffic is allowed to depend on remote cache?
  • Is reload faster than recompute for each block class?
  • How long does a cold replica take to become useful?
  • What happens when CPU or NVMe bandwidth is saturated?

How it connects to the rest of the series

KV caching supplies offloadable state. PagedAttention defines cache blocks. Prefill-decode disaggregation transfers KV between GPU pools, while mixed precision reduces every tier’s byte count.

From equation to implementation

For weight streaming, the necessary condition for fully hidden transfer is transfer_time(next layer) <= compute_time(current layer). If a layer has W bytes and effective link bandwidth B, transfer takes at least W/B before software overhead. Small batches shorten compute and make hiding harder.

KV offload has a different decision: reload cost versus recompute cost versus cache miss penalty. A retention score can combine expected reuse probability, reload bytes, recompute FLOPs, tenant priority, and recency instead of using plain LRU.

Implementation sketch

for layer in model_layers:
    await prefetched[layer]
    launch_prefetch(layer + lookahead, target=inactive_buffer)
    output = run_layer(layer, input, active_buffer)
    release_or_demote(layer, active_buffer)
for kv_block in eviction_candidates:
    if expected_reload_value(kv_block) > offload_cost(kv_block):
        copy_async(kv_block, host_cache)
    else:
        evict(kv_block)

Capacity planning

Reserve host memory for the operating system, page cache, network buffers, and other workers. Pinned memory is not ordinary RAM: excessive pinning can harm the node. NVMe endurance and shared bandwidth also matter for sustained batch workloads.

Benchmarking without fooling yourself

  • Measure effective bandwidth with concurrent production-like transfers.
  • Compare no offload, CPU offload, and NVMe offload at equal quality.
  • Sweep batch size to reveal transfer amortization.
  • Record compute-transfer overlap and GPU idle gaps.

A production failure to design for

Several replicas on one node independently allocate huge pinned host caches. The node remains below nominal RAM capacity but network and storage buffers cannot allocate, causing broad instability. Coordinate offload budgets at node level.

OPERATING LOOPOperational loop1InventoryHBM host and NVMeLink topology2ClassifyWeights versus KVHot versus cold3OverlapPrefetch and double bufferMeasure idle gaps4ControlNode-wide budgetsEviction valueMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Inventory to Classify, then Overlap and Control. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Offloading creates a hierarchy: HBM for imminent compute, host DRAM for warm state, local NVMe for colder state, and sometimes remote storage for durable artifacts. Moving a tensor is worthwhile only when transfer plus restore costs less than recomputation and completes before the request deadline.

A tiered memory decisionClassifyEstimate reuse timeMeasure tensor sizeKnow recompute costEvictChoose colder tierReserve destinationCopy asynchronouslyPrefetchPredict next useOverlap transferValidate ownershipConsumeWait only if neededRun computeUpdate temperatureEvery move needs a deadline, bandwidth budget, and cancellation path.
Offloading succeeds when placement anticipates future use rather than reacting after a miss.

How to read this diagram: Follow the state from Classify through Evict and Prefetch to Consume. Each box is an ownership or computation boundary. In particular, every move needs a deadline, bandwidth budget, and cancellation path. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Bandwidth is shared. Aggressive KV spill can compete with model loading, collectives, storage writes, and active decode. Limit concurrent transfers per link and prioritize state required by near-deadline requests. Pinned host memory improves DMA but is finite and affects the operating system.

Capacity increases by spending transfer timeHBM-onlyfast, scarceHBM + host tierlarger capacityLatency riskA late prefetch stalls the compute stream on transfer.Capacity gainCold state stops evicting immediately useful tensors.
The right tier depends on reuse distance and recomputation cost.

How to read this diagram: The bars compare HBM-only with HBM + host tier on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Cold state stops evicting immediately useful tensors.”, remains larger than the risk, “A late prefetch stalls the compute stream on transfer.”, under production traffic.

Eviction value should include bytes, predicted reuse, restore latency, and recompute time. Plain LRU can retain a huge cheap-to-recompute tensor while evicting a small expensive prefix. Admission to a lower tier also needs limits or the tier becomes a slower unbounded queue.

Offloaded object lifecycleResidentHBM owns stateMigratingcopy in flightColdlower tier validRestoringprefetch before useOnly one authoritative version is writable; transfers publish atomically.
Explicit ownership prevents stale reads during overlapping migration and compute.

How to read this diagram: State advances from Resident to Migrating, Cold, and finally Restoring. The labels below each state identify what becomes true at that boundary. The governing invariant is: Only one authoritative version is writable; transfers publish atomically. Retries and cancellation must preserve the same transition rules.

Four inputs to placement valueReuseProbability and distanceRequest deadlineCostBytes and link timeRecompute alternativePressureHBM watermarkTransfer concurrencyCorrectnessVersion and ownershipCancellation cleanupReport bytes moved, useful prefetches, late restores, and avoided recompute.
Tiering policy should optimize useful work, not cache occupancy.

How to read this diagram: The four panels are independent review axes: Reuse, Cost, Pressure, and Correctness. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Report bytes moved, useful prefetches, late restores, and avoided recompute.

An offload storm can amplify overloadHBM fillsMany objects evictTransfers saturate linkPrefetches waitDecode needs stateCompute streams stallQueues growRetries add pressureMore state is spilledControlUse hysteresisBound migrationsPreserve an uncached or recompute path when the lower tier is unhealthy.
Offload must degrade into extra compute, not an availability dependency.

How to read this diagram: This is a causal chain, not four unrelated symptoms. HBM fills triggers Prefetches wait, which creates Queues grow. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Offloading lets capacity exceed HBM, but bandwidth becomes the new budget. A good design knows exactly which transfer is hidden and which one sits on the token path.