The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You
The easiest way to make LLM caching confusing is to use the word “cache” without saying what is being cached.
Are we caching input tokens? A provider-side prompt prefix? A retrieved context bundle? A full answer? A vector match to an earlier question? A KV cache block sitting on a GPU? Those are different systems with different failure modes.
This matters because caching is not just a cost trick. It changes the product contract. The user expects a correct answer, not a museum exhibit of what the model said last Tuesday.
So let us separate the layers.
The four caches people accidentally mix together
| Layer | What it reuses | Main win | Main risk |
|---|---|---|---|
| Exact response cache | Same normalized request | Cheapest hit | Low coverage |
| Semantic response cache | Same intent, different words | Big cost reduction | False positive, stale answer |
| Prompt/context cache | Same input prefix sent to model | Lower input-token cost and latency | Still generates output |
| KV/prefix cache | Internal attention state | Faster prefill and routing locality | Memory pressure, tenant boundaries |
Prompt caching and semantic caching sound similar because both avoid repeated work. They are not substitutes.
Prompt caching says: “I have already processed this prefix.” It can reduce cost and latency for repeated input tokens. OpenAI’s prompt caching, for example, is automatic for supported models when prompts exceed a prefix threshold, and the API exposes cached input tokens in usage details. Google’s Gemini context caching similarly lets applications store reusable context and reference it later. This is great for repeated system prompts, tool schemas, policy packs, and long static documents.
Semantic caching says: “This new question means the same thing as one I answered before.” Redis LangCache and semantic cache patterns use embeddings and similarity search to reuse a prior answer when the match is strong enough. That can remove the model call entirely. It can also be wrong in a much more interesting way.
Prompt caching: boring, useful, and not magic
Prompt caching works best when a large prefix is repeated exactly or nearly exactly:
- system prompt
- tool schemas
- policy pack
- long product documentation
- few-shot examples
- static coding repository context
- fixed RAG preamble
The cost model is straightforward. If a provider discounts cached input tokens, or a runtime reuses prefix state, repeated long prefixes become cheaper and faster. But the model still generates output. If the user asks the same support question 10,000 times, prompt caching may reduce the input side; semantic response caching can avoid the generation completely.
Prompt caching is also sensitive to ordering. If you put highly variable user text before the stable instruction block, you make the reusable prefix shorter. Put stable content first. Put volatile content later.
Good prompt layout:
system instructions
tool schemas
stable policy context
retrieved documents
user questionBad prompt layout:
user question
session noise
tool schemas
stable policy contextThe second layout may be semantically equivalent, but it is worse for prefix reuse.
Semantic caching: where the money is, and where the risk lives
Semantic caching is attractive because users rarely repeat text exactly. They ask:
"How do I reset SSO?"
"Where can I rotate SAML certificates?"
"We need to update Okta metadata. What is the process?"These might map to one canonical answer. An embedding search can find that relationship even when exact text does not match.
But semantic caching must be gated. It is not enough to say “cosine similarity is 0.91.” The cache hit also needs:
- same tenant or public scope
- same locale
- same product/version
- same permission boundary
- same source document version
- same answer type
- freshness still valid
- no policy override
If any of those fail, the system should miss. A false miss costs tokens. A false hit costs credibility.
The decision table
| Workload | Use prompt cache? | Use semantic response cache? | Why |
|---|---|---|---|
| Repeated tool schemas | Yes | No | Same prefix, output changes |
| FAQ / docs answers | Yes | Yes | Stable facts, repeated intent |
| Account-specific support | Yes | Sometimes | Cache explanation, fill live data |
| Legal or regulated answers | Yes | Carefully | High precision, strict freshness |
| Coding agents | Yes | Rarely for final answer | Context repeats, output is task-specific |
| Incident status | Maybe | Usually no | Truth changes quickly |
The safest strategy is layered:
- Exact cache for identical requests.
- Semantic cache only for approved classes.
- Prompt cache for static prefixes on every model call.
- KV-aware routing for repeated prefixes inside the serving fleet.
Thresholds are product policy
Semantic cache threshold tuning is not just ML tuning. It is a product decision.
Lower threshold:
- higher hit rate
- lower cost
- higher false-positive risk
Higher threshold:
- lower hit rate
- higher cost
- safer answers
Start high. Shadow-test before serving. Track user corrections. Sample cache hits for human review. Give operators a purge button by canonical intent, source document, tenant, and policy version.
The operational checklist
If you only take one thing from this post, take this:
- Put stable prompt prefixes before volatile user text.
- Track cached input tokens separately from normal input tokens.
- Treat semantic hits as answers that require proof, not just similarity.
- Store source hashes and policy versions with every cached response.
- Cache answer plans separately from dynamic user facts.
- Shadow semantic cache decisions before serving them.
- Measure safe hit rate, false positives, stale-hit blocks, and cost avoided.
- Keep a per-intent purge and refresh path.
The best cache is not the one with the biggest hit rate. It is the one that saves money without making users suspicious.
Sources worth reading
- OpenAI prompt caching and API usage details for cached tokens.
- Google Gemini context caching for reusable long context.
- Redis LangCache and Redis semantic caching patterns.
- LangChain caching integrations for application-level LLM caches.
- vLLM Automatic Prefix Caching for runtime prefix reuse.
