Skip to content
Inference Is a Memory Problem: KV Cache, HBM, and the Real Cost of Long Context

Inference Is a Memory Problem: KV Cache, HBM, and the Real Cost of Long Context

The first time you size an LLM inference cluster, it is tempting to do the heroic spreadsheet thing: parameters, FLOPS, GPU count, tokens per second. Very crisp. Very executive. Also, often wrong.

For training, compute dominates the conversation. For inference, especially long-context and multi-turn inference, memory walks into the room, takes the chair at the head of the table, and starts asking uncomfortable questions.

The star of this awkward meeting is the KV cache.

Every generated token needs attention over previous tokens. To avoid recomputing the key and value tensors for the full prefix on every decode step, inference engines store them. That is the KV cache. It is brilliant. It is also hungry. The longer the context, the more layers, heads, concurrent sessions, and output tokens you serve, the more the cache becomes the thing you are actually operating.

In other words: the model weights get you into the game. The KV cache decides how many players can stay at the table.

The simple shape of the problem

The cache grows roughly with:

layers x tokens x KV heads x head dimension x bytes per element x active sequences

That is not a pricing formula. That is a reality check.

If your workload is short prompts with short answers, you mostly care about getting the model weights loaded and keeping batches full. If your workload is retrieval-heavy chat, coding agents, support agents, or long document analysis, the cache becomes a second model-sized object that grows while the request is alive.

How memory is used during LLM inferenceDiagram showing model weights as a fixed block and KV cache as a growing block per active sequence.Inference memory is not one bucketModel weightsMostly fixed after loadQuantization changes sizeShared by requestsKV cacheGrows with context lengthMultiplies by active sequencesCan dominate long sessionsshort prompt: weights matterlong chat: cache starts driving
The part that grows while users talk is usually the part that ruins the spreadsheet.

HBM is the expensive apartment

High bandwidth memory is not just “GPU RAM.” It is the hot apartment next to the elevator. Everything wants to live there: weights, activations, temporary buffers, and KV cache. The catch is that HBM is finite, and decode is constantly reading the cache while generating one token after another.

This is why H200 mattered: NVIDIA describes H200 as adding 141 GB of HBM3e with 4.8 TB/s bandwidth, a meaningful jump over H100 for memory-heavy inference. This is also why Blackwell conversations often include NVFP4, bigger systems, NVLink, and software like TensorRT-LLM and Dynamo in the same breath. The point is not just “more math.” The point is more useful tokens before memory becomes the wall.

There is a small trap here. More HBM does not automatically mean cheaper inference. It means you have more room to make intelligent scheduling decisions. If your scheduler fills memory with low-value cache blocks, the GPU becomes an expensive waiting room. If your router knows which prefixes are hot and your engine can retain the right blocks, the same hardware feels like it got a promotion.

Paging was the first big unlock

vLLM made the KV cache feel less like a giant contiguous reservation and more like virtual memory. PagedAttention partitions KV cache into blocks, reducing fragmentation and enabling sharing across requests. The Berkeley team reported 2-4x throughput improvements over earlier baselines on evaluated workloads, largely by making memory management less wasteful.

This is one reason vLLM became such an important reference point. It changed the mental model: the cache is not an implementation detail; it is a managed resource.

TensorRT-LLM now has its own KV cache capabilities, including paged KV cache, quantized KV cache, reuse, and newer priority-based eviction APIs. The priority feature is especially interesting because it lets applications say, “This part of the context matters more. Keep it warm if you can.” For agents, that distinction becomes gold: system prompt and durable memory are worth retaining; a failed tool call from six steps ago is probably not.

Offload is useful, but it is not magic

Once HBM fills up, the obvious move is to spill cache to CPU memory, local disk, or remote storage. NVIDIA Dynamo’s documentation and blogs talk about KV cache offloading across memory hierarchies, and that is the right architectural direction for large fleets.

But offload is not a coupon code. It trades capacity for latency. HBM is fastest. CPU memory is slower. NVMe and remote tiers are slower again. The question is not “Can I offload?” The question is “Can I predict which cache blocks I will need soon enough to prefetch them before the model stalls?”

KV cache memory hierarchyDiagram showing HBM, CPU memory, local SSD, and remote storage with increasing capacity and increasing latency.KV cache tiers: hotter is faster, colder is biggerHBMCPU RAMLocal SSDRemote tierfastestsmallestwarm cacheprefetch neededlargestlower latencymore capacity
Offload helps when the router and runtime can predict reuse. Randomly paging cache is just latency wearing a nice jacket.

The operational lesson

The practical playbook is:

  1. Keep model weights small enough for the target hardware using appropriate precision.
  2. Treat KV cache as a first-class resource, not as leftover memory.
  3. Use prefix caching when prompts repeat.
  4. Route repeat sessions toward warm cache when it improves latency.
  5. Use priority eviction for durable context.
  6. Offload cold blocks only when prefetch can hide the latency.

Design review questions

When this topic comes up in an architecture review, I would ask these before asking for more GPUs:

  • What is the p50/p95/p99 input context length by product surface?
  • How much HBM is weights, how much is KV cache, and how much is temporary workspace under peak concurrency?
  • Which prefixes repeat often enough to deserve cache priority?
  • What is the cache eviction policy when memory pressure hits?
  • Can the router distinguish durable context from one-off context?
  • What happens to TTFT when cache hit rate drops by half?

Those questions are uncomfortable in the useful way. They move the conversation from “the model is slow” to “this workload has a memory lifecycle we can optimize.”

The NVIDIA stack is strong here because the hardware and software are being co-designed around this exact pain: HBM, NVLink, TensorRT-LLM cache controls, NIM packaging, and Dynamo routing/offload are not separate features. They are different floors in the same building.

That does not mean every workload needs the full building. A single-model internal chatbot may run beautifully on vLLM with prefix caching and a sensible max context. But if you are building multi-tenant, long-context, agent-heavy inference, memory is not a footnote. Memory is the product surface.

The fun part is that once you see it, you cannot unsee it. Every latency spike starts looking like a cache story. Every “we need more GPUs” meeting becomes a “do we understand our KV lifecycle?” meeting. Slightly less dramatic, much cheaper.

Sources and receipts