The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You

#ai #inference #llm #prompt-caching #semantic-cache #cost #latency

The easiest way to make LLM caching confusing is to use the word “cache” without saying what is being cached.

Are we caching input tokens? A provider-side prompt prefix? A retrieved context bundle? A full answer? A vector match to an earlier question? A KV cache block sitting on a GPU? Those are different systems with different failure modes.

This matters because caching is not just a cost trick. It changes the product contract. The user expects a correct answer, not a museum exhibit of what the model said last Tuesday.

So let us separate the layers.

The four caches people accidentally mix together

Layer	What it reuses	Main win	Main risk
Exact response cache	Same normalized request	Cheapest hit	Low coverage
Semantic response cache	Same intent, different words	Big cost reduction	False positive, stale answer
Prompt/context cache	Same input prefix sent to model	Lower input-token cost and latency	Still generates output
KV/prefix cache	Internal attention state	Faster prefill and routing locality	Memory pressure, tenant boundaries

Prompt caching and semantic caching sound similar because both avoid repeated work. They are not substitutes.

Prompt caching says: “I have already processed this prefix.” It can reduce cost and latency for repeated input tokens. OpenAI’s prompt caching, for example, is automatic for supported models when prompts exceed a prefix threshold, and the API exposes cached input tokens in usage details. Google’s Gemini context caching similarly lets applications store reusable context and reference it later. This is great for repeated system prompts, tool schemas, policy packs, and long static documents.

Semantic caching says: “This new question means the same thing as one I answered before.” Redis LangCache and semantic cache patterns use embeddings and similarity search to reuse a prior answer when the match is strong enough. That can remove the model call entirely. It can also be wrong in a much more interesting way.

Response caching avoids generation. Prompt and runtime caching optimize the generation path that remains.

Prompt caching: boring, useful, and not magic

Prompt caching works best when a large prefix is repeated exactly or nearly exactly:

system prompt
tool schemas
policy pack
long product documentation
few-shot examples
static coding repository context
fixed RAG preamble

The cost model is straightforward. If a provider discounts cached input tokens, or a runtime reuses prefix state, repeated long prefixes become cheaper and faster. But the model still generates output. If the user asks the same support question 10,000 times, prompt caching may reduce the input side; semantic response caching can avoid the generation completely.

Prompt caching is also sensitive to ordering. If you put highly variable user text before the stable instruction block, you make the reusable prefix shorter. Put stable content first. Put volatile content later.

Good prompt layout:

system instructions
tool schemas
stable policy context
retrieved documents
user question

Bad prompt layout:

user question
session noise
tool schemas
stable policy context

The second layout may be semantically equivalent, but it is worse for prefix reuse.

Semantic caching: where the money is, and where the risk lives

Semantic caching is attractive because users rarely repeat text exactly. They ask:

"How do I reset SSO?"
"Where can I rotate SAML certificates?"
"We need to update Okta metadata. What is the process?"

These might map to one canonical answer. An embedding search can find that relationship even when exact text does not match.

But semantic caching must be gated. It is not enough to say “cosine similarity is 0.91.” The cache hit also needs:

same tenant or public scope
same locale
same product/version
same permission boundary
same source document version
same answer type
freshness still valid
no policy override

If any of those fail, the system should miss. A false miss costs tokens. A false hit costs credibility.

The decision table

Workload	Use prompt cache?	Use semantic response cache?	Why
Repeated tool schemas	Yes	No	Same prefix, output changes
FAQ / docs answers	Yes	Yes	Stable facts, repeated intent
Account-specific support	Yes	Sometimes	Cache explanation, fill live data
Legal or regulated answers	Yes	Carefully	High precision, strict freshness
Coding agents	Yes	Rarely for final answer	Context repeats, output is task-specific
Incident status	Maybe	Usually no	Truth changes quickly

The safest strategy is layered:

Exact cache for identical requests.
Semantic cache only for approved classes.
Prompt cache for static prefixes on every model call.
KV-aware routing for repeated prefixes inside the serving fleet.

Thresholds are product policy

Semantic cache threshold tuning is not just ML tuning. It is a product decision.

Lower threshold:

higher hit rate
lower cost
higher false-positive risk

Higher threshold:

lower hit rate
higher cost
safer answers

Start high. Shadow-test before serving. Track user corrections. Sample cache hits for human review. Give operators a purge button by canonical intent, source document, tenant, and policy version.

Tune by answer class. A refund policy and a keyboard shortcut do not deserve the same threshold.

The operational checklist

If you only take one thing from this post, take this:

Put stable prompt prefixes before volatile user text.
Track cached input tokens separately from normal input tokens.
Treat semantic hits as answers that require proof, not just similarity.
Store source hashes and policy versions with every cached response.
Cache answer plans separately from dynamic user facts.
Shadow semantic cache decisions before serving them.
Measure safe hit rate, false positives, stale-hit blocks, and cost avoided.
Keep a per-intent purge and refresh path.

The best cache is not the one with the biggest hit rate. It is the one that saves money without making users suspicious.

Sources worth reading

OpenAI prompt caching and API usage details for cached tokens.
Google Gemini context caching for reusable long context.
Redis LangCache and Redis semantic caching patterns.
LangChain caching integrations for application-level LLM caches.
vLLM Automatic Prefix Caching for runtime prefix reuse.

Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform