Skip to content
The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You

The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You

The easiest way to make LLM caching confusing is to use the word “cache” without saying what is being cached.

Are we caching input tokens? A provider-side prompt prefix? A retrieved context bundle? A full answer? A vector match to an earlier question? A KV cache block sitting on a GPU? Those are different systems with different failure modes.

This matters because caching is not just a cost trick. It changes the product contract. The user expects a correct answer, not a museum exhibit of what the model said last Tuesday.

So let us separate the layers.

The four caches people accidentally mix together

LayerWhat it reusesMain winMain risk
Exact response cacheSame normalized requestCheapest hitLow coverage
Semantic response cacheSame intent, different wordsBig cost reductionFalse positive, stale answer
Prompt/context cacheSame input prefix sent to modelLower input-token cost and latencyStill generates output
KV/prefix cacheInternal attention stateFaster prefill and routing localityMemory pressure, tenant boundaries

Prompt caching and semantic caching sound similar because both avoid repeated work. They are not substitutes.

Prompt caching says: “I have already processed this prefix.” It can reduce cost and latency for repeated input tokens. OpenAI’s prompt caching, for example, is automatic for supported models when prompts exceed a prefix threshold, and the API exposes cached input tokens in usage details. Google’s Gemini context caching similarly lets applications store reusable context and reference it later. This is great for repeated system prompts, tool schemas, policy packs, and long static documents.

Semantic caching says: “This new question means the same thing as one I answered before.” Redis LangCache and semantic cache patterns use embeddings and similarity search to reuse a prior answer when the match is strong enough. That can remove the model call entirely. It can also be wrong in a much more interesting way.

Prompt caching and semantic caching layersA request first checks exact and semantic response caches, then falls through to prompt caching and runtime prefix reuse.Caching is a stack, not one feature flagResponse caches can skip the model. Prompt and KV caches make model calls cheaper when you still need them.User requestwords + scopeExact answerhash hit, safestSemantic answermeaning hit, gatedReturnno model callPrompt cachereuse static prefixKV / prefix cachereuse prefill stateGeneratemodel still runs
Response caching avoids generation. Prompt and runtime caching optimize the generation path that remains.

Prompt caching: boring, useful, and not magic

Prompt caching works best when a large prefix is repeated exactly or nearly exactly:

  • system prompt
  • tool schemas
  • policy pack
  • long product documentation
  • few-shot examples
  • static coding repository context
  • fixed RAG preamble

The cost model is straightforward. If a provider discounts cached input tokens, or a runtime reuses prefix state, repeated long prefixes become cheaper and faster. But the model still generates output. If the user asks the same support question 10,000 times, prompt caching may reduce the input side; semantic response caching can avoid the generation completely.

Prompt caching is also sensitive to ordering. If you put highly variable user text before the stable instruction block, you make the reusable prefix shorter. Put stable content first. Put volatile content later.

Good prompt layout:

system instructions
tool schemas
stable policy context
retrieved documents
user question

Bad prompt layout:

user question
session noise
tool schemas
stable policy context

The second layout may be semantically equivalent, but it is worse for prefix reuse.

Semantic caching: where the money is, and where the risk lives

Semantic caching is attractive because users rarely repeat text exactly. They ask:

"How do I reset SSO?"
"Where can I rotate SAML certificates?"
"We need to update Okta metadata. What is the process?"

These might map to one canonical answer. An embedding search can find that relationship even when exact text does not match.

But semantic caching must be gated. It is not enough to say “cosine similarity is 0.91.” The cache hit also needs:

  • same tenant or public scope
  • same locale
  • same product/version
  • same permission boundary
  • same source document version
  • same answer type
  • freshness still valid
  • no policy override

If any of those fail, the system should miss. A false miss costs tokens. A false hit costs credibility.

The decision table

WorkloadUse prompt cache?Use semantic response cache?Why
Repeated tool schemasYesNoSame prefix, output changes
FAQ / docs answersYesYesStable facts, repeated intent
Account-specific supportYesSometimesCache explanation, fill live data
Legal or regulated answersYesCarefullyHigh precision, strict freshness
Coding agentsYesRarely for final answerContext repeats, output is task-specific
Incident statusMaybeUsually noTruth changes quickly

The safest strategy is layered:

  1. Exact cache for identical requests.
  2. Semantic cache only for approved classes.
  3. Prompt cache for static prefixes on every model call.
  4. KV-aware routing for repeated prefixes inside the serving fleet.

Thresholds are product policy

Semantic cache threshold tuning is not just ML tuning. It is a product decision.

Lower threshold:

  • higher hit rate
  • lower cost
  • higher false-positive risk

Higher threshold:

  • lower hit rate
  • higher cost
  • safer answers

Start high. Shadow-test before serving. Track user corrections. Sample cache hits for human review. Give operators a purge button by canonical intent, source document, tenant, and policy version.

Semantic cache threshold tradeoffA threshold slider shows the tradeoff between hit rate and false-positive risk for semantic caching.The semantic-cache threshold is a risk dialLower thresholds buy more savings. Higher thresholds buy fewer embarrassing wrong hits.0.93Aggressivecheap, risky, needs reviewShadow firstmeasure before servingConservativesafe, fewer hits
Tune by answer class. A refund policy and a keyboard shortcut do not deserve the same threshold.

The operational checklist

If you only take one thing from this post, take this:

  • Put stable prompt prefixes before volatile user text.
  • Track cached input tokens separately from normal input tokens.
  • Treat semantic hits as answers that require proof, not just similarity.
  • Store source hashes and policy versions with every cached response.
  • Cache answer plans separately from dynamic user facts.
  • Shadow semantic cache decisions before serving them.
  • Measure safe hit rate, false positives, stale-hit blocks, and cost avoided.
  • Keep a per-intent purge and refresh path.

The best cache is not the one with the biggest hit rate. It is the one that saves money without making users suspicious.

Sources worth reading