Reduce LLM Inference Cost by 60% Without Serving Stale Answers
Here is a very real production shape:
100,000 LLM queries / day
$0.40 average cost / query
= $40,000 / day
60,000 queries are slight variations of the same 200 questionsIf every one of those requests goes to the model, the system is not being thoughtful. It is paying a very expensive intern to re-type the FAQ all day.
The tempting answer is “just cache it.” That is also how you accidentally serve yesterday’s policy, a stale price, a hallucinated product detail, or a response that was correct for one tenant but wrong for another.
The right answer is not a dumb cache. It is a freshness-aware semantic answer system:
- semantic response cache for repeated questions written in different words
- freshness contract attached to every cached answer
- deterministic invalidation for source changes, policy changes, and tenant permissions
- prompt/KV caching underneath for requests that still need model generation
- cache-aware routing so misses are still cheaper and faster than random placement
That sounds like a lot. It is. But the shape is clean once you separate the layers.
The math: 60% is the ceiling, not the promise
If 60,000 of 100,000 daily requests are variants of the same 200 questions, the absolute best response-cache result is close to 60% model-call avoidance. Not exactly 60%, because you still need to generate or refresh canonical answers.
If those 200 canonical answers refresh once per day:
Before:
100,000 model calls x $0.40 = $40,000 / day
After:
40,000 unique calls
+ 200 canonical refresh calls
= 40,200 model calls
40,200 x $0.40 = $16,080 / day
Savings:
$23,920 / day
= 59.8% before cache infrastructure costIf the same 200 answers refresh hourly:
40,000 unique calls
+ (200 canonical answers x 24 refreshes)
= 44,800 model calls
44,800 x $0.40 = $17,920 / day
Savings:
55.2% before cache infrastructure costThat is still serious money. The point is not that every workload gets 60%. The point is that repeated intent changes the economics from “generate every time” to “generate when truth changes.”
Do not confuse the four caches
People say “cache” and then mix four different ideas:
| Cache | What it saves | What it does not save |
|---|---|---|
| Exact response cache | Identical request text | Paraphrases |
| Semantic response cache | Similar user intent | Unsafe or stale matches |
| Retrieval/tool cache | Repeated context gathering | Final generation |
| Prompt/KV cache | Recomputing shared prefixes | Output-token generation |
You need all four in serious systems, but they sit at different layers.
Semantic response caching is the big lever for the problem above. Redis describes this pattern as storing and reusing previous LLM responses for repeated queries, where paraphrases like “features of Product A” and “main features of Product A” can reuse the same answer when similarity is high enough. That is response-level avoidance.
Prompt and KV caching are different. OpenAI’s prompt caching discounts and speeds up reused input-token prefixes. vLLM Automatic Prefix Caching reuses KV cache when requests share a prefix, which helps prefill. TensorRT-LLM’s KV cache supports reuse across requests, offloading, and prioritized eviction. Dynamo’s router can use KV overlap and load to route requests toward workers with useful cached blocks.
Those runtime features are excellent. They just do not eliminate the whole generation for a paraphrased question. They make the miss path cheaper. The response cache removes the model call entirely when it is safe.
The production architecture
The architecture I would build has seven decisions in the request path.
1. Normalize the request
Normalize before lookup:
- trim whitespace
- canonicalize casing where safe
- remove tracking noise
- normalize product names and aliases
- preserve locale, tenant, role, and entitlement data separately
- keep the original text for observability and user experience
Do not over-normalize. “How do I reset my password?” and “Reset the password for user Alice” are not the same operation. Normalization should reduce noise, not erase intent.
2. Classify cache eligibility
Not every question deserves semantic caching.
Good candidates:
- product FAQ
- documentation Q&A
- internal policy explanations
- “how do I” help
- troubleshooting steps with stable source docs
- repeated onboarding questions
Bad candidates:
- user-specific account state
- legal, medical, or financial advice without review gates
- live inventory, price, or incident status
- queries containing secrets or sensitive personal data
- long reasoning tasks where the wording changes the answer
- anything requiring fresh tool execution
The cache should have a strong “no” muscle. A false miss costs money. A false hit costs trust.
3. Search exact cache first
Exact cache is cheap and safe. Use a key like:
sha256(
model_family
+ prompt_template_version
+ tenant_id
+ locale
+ normalized_question
+ entitlement_scope
)If it hits and the freshness contract is still valid, return it. No vector search needed.
4. Search semantic cache next
On an exact miss:
- embed the normalized question
- search nearest cached questions
- apply a high similarity threshold
- optionally rerank near-threshold candidates
- verify tenant, locale, source version, policy version, and answer type
Redis’s semantic caching guide describes this shape: convert query to vector, run similarity search, and return a cached answer only when the similarity score exceeds the chosen threshold. It also calls out the precision-recall tradeoff: lower thresholds improve hit rate but increase wrong-answer risk.
I would start conservative:
| Question class | Starting threshold | Why |
|---|---|---|
| FAQ / docs | 0.90-0.95 | High precision, easy wins |
| Support troubleshooting | 0.88-0.93 | Similar wording often maps well |
| Policy explanation | 0.92-0.97 | Stale or wrong policy is painful |
| User-specific task | disabled | Run the tool |
| Compliance-sensitive answer | disabled or human-reviewed | Do not gamble |
This is not a universal table. It is a safe starting posture. Your telemetry should tune it.
5. Run the freshness gate
The cached answer is not just text. It is an object with a contract:
{
"answer_id": "faq.billing.refunds.v14",
"canonical_question": "How do refunds work?",
"answer_text": "...",
"source_ids": ["docs/billing/refunds.md"],
"source_versions": ["sha256:7f3a..."],
"prompt_template_version": "support-answer-v8",
"model_family": "llama-3.1-70b",
"tenant_scope": "public-docs",
"locale": "en-US",
"freshness_class": "policy",
"expires_at": "2026-05-05T18:30:00Z",
"last_verified_at": "2026-05-05T09:30:00Z"
}The freshness gate checks:
- source document hash still matches
- policy version still matches
- tenant permissions still match
- locale still matches
- model/prompt version is still allowed
- answer has not expired
- no red-team or quality rule has blocked this answer family
If any check fails, it is a cache miss. Smile politely and generate a new answer.
6. Compose dynamic slots late
This is how you avoid the “cached and stale” smell.
Do not cache:
"Your current balance is $184.22 and your renewal date is June 8."Cache:
Answer plan:
- explain how billing renewal works
- include account balance from billing API
- include renewal date from subscription API
- link to billing settingsThen fill the dynamic slots from live systems after the cache hit.
This keeps the expensive language work cached while still making the answer feel current. The model wrote the stable explanation once. Your application fills the facts every time.
7. Store misses only after quality checks
A miss should not blindly populate the cache. Store only when:
- generation completed successfully
- answer passed policy and quality checks
- sources are known
- freshness class is assigned
- tenant scope is safe
- output is not too personalized
- the route is worth caching
If a one-off weird request misses, do not memorialize it forever. Caches should have taste.
The freshness contract
The trick is not “cache for 24 hours.” The trick is “cache according to the volatility of truth.”
| Freshness class | Example | Cache strategy |
|---|---|---|
| Static | “How do I enable SSO?” | Long TTL, invalidate on doc update |
| Slow policy | “What is the refund policy?” | Medium TTL, invalidate on policy version |
| Dynamic slots | “When does my plan renew?” | Cache explanation, fill account data live |
| Live operational | “Is service X down?” | Do not cache final answer; cache tool result briefly |
| Regulated | “Can I do X with customer data?” | Short TTL, strict source pinning, optional review |
This is where many systems fail. They cache text but not truth. The cached answer needs to know why it was true.
How this reaches 60% without feeling cached
The user should not feel like they got a stale canned response. They should feel like the system was fast and competent.
That requires three design choices.
Cache canonical answers, not user phrasing
The user asks:
"Can I get money back if I cancel after two weeks?"The canonical question might be:
"What is the refund policy after cancellation?"Store the answer under the canonical intent. Keep the user’s original wording for logs and analytics, but do not make every phrasing a new product truth.
Return stable explanation, fill live facts
For policy, docs, and help, a cached explanation is fine.
For account-specific answers, split the response:
cached:
explanation of how renewals work
live:
user's renewal date
current plan
current balance
eligible actionsThis is the difference between “cached answer” and “cached brain with live eyes.”
Use stale-while-revalidate for popular intents
For the top 200 questions, do not wait for users to discover staleness.
Run a background job:
- refresh popular intents on schedule
- refresh immediately when source docs change
- compare new answer to old answer
- alert when the answer changed materially
- keep the old answer only if its freshness contract still passes
For slow-changing docs, this keeps latency low. For policy changes, it gives you a controlled rollout instead of surprise.
Where NVIDIA-style runtime caching fits
The response cache is above the model. It decides whether the model should be called at all.
If the request misses, the runtime layer still matters a lot:
- vLLM Automatic Prefix Caching can reuse KV cache for shared prompt prefixes and avoid recomputing the shared prefill.
- SGLang RadixAttention organizes common prefixes in a radix tree so requests with shared context can reuse KV state.
- TensorRT-LLM supports KV reuse across requests, plus tools such as offloading and prioritized eviction.
- Dynamo adds KV-aware routing and distributed cache coordination, so misses can be routed toward workers with useful cache overlap while still respecting load.
This is the hierarchy I like:
semantic cache hit:
skip model generation
semantic cache miss:
use prompt/KV caching
use KV-aware routing
use batching and good inference kernelsThat is a strong systems story. The gateway saves the obvious money. The inference runtime saves the remaining money. The GPU still does the hard work, but it stops doing the same hard work for no reason.
A practical scoring function
Semantic cache hit decisions should be boring and auditable:
eligible =
cache_allowed
and tenant_match
and locale_match
and source_versions_current
and policy_version_current
and similarity_score >= threshold
and answer_type_safe
and quality_status == "approved"For near-threshold matches, use a second check:
if 0.86 <= similarity_score < 0.92:
run lightweight reranker or verifier
else if similarity_score >= 0.92:
serve if freshness passes
else:
missDo not use one global threshold. A product FAQ and a compliance answer do not deserve the same risk setting.
What the API could look like
The gateway API can keep this simple:
{
"query": "Can I get a refund if I cancel next week?",
"tenant": "public",
"locale": "en-US",
"user_scope": ["docs:read"],
"cache_policy": {
"semantic": true,
"min_similarity": 0.93,
"allow_dynamic_slots": true,
"max_staleness_seconds": 3600
}
}Response on hit:
{
"answer": "Refund eligibility depends on your plan and cancellation window...",
"cache": {
"status": "semantic_hit",
"similarity": 0.96,
"canonical_question": "What is the refund policy after cancellation?",
"freshness": "valid",
"source_versions": ["sha256:7f3a..."]
}
}Response on miss:
{
"answer": "Refund eligibility depends on your plan and cancellation window...",
"cache": {
"status": "miss",
"stored": true,
"freshness_class": "policy"
}
}Expose the cache metadata internally, not necessarily to end users. Operators need to know why an answer was served.
The dashboard I would build
If this system is saving tens of thousands of dollars a day, the dashboard should not be decorative.
Track:
- request count
- exact hit rate
- semantic hit rate
- miss rate
- false-positive reports
- stale-blocked hits
- freshness-gate failures by reason
- model calls avoided
- estimated dollars avoided
- p50 / p95 latency by hit type
- answer refresh count
- cache entries by freshness class
- top canonical intents
- cache hit rate by tenant and locale
- post-hit user correction rate
The most important chart is not hit rate. It is safe hit rate:
safe hit rate =
cache hits that passed freshness and quality gates
/ total requestsA high hit rate with wrong answers is not optimization. It is a very fast apology generator.
Failure modes to design for
False semantic match
Two questions look similar but need different answers.
Mitigation:
- higher threshold
- class-specific threshold
- rerank near threshold
- exact entity matching
- user feedback loop
- do not cache high-risk classes
Stale source
The source document changed after the answer was cached.
Mitigation:
- source hash in cache record
- event-driven invalidation
- background refresh
- stale-while-revalidate only for low-risk content
Tenant leakage
A cached answer from one customer is served to another.
Mitigation:
- tenant in cache key
- entitlement scope in cache key
- no cross-tenant semantic cache unless content is explicitly public
Personalized answer cached as generic
The answer includes a user’s account state and gets reused.
Mitigation:
- PII detection before store
- dynamic slot pattern
- answer-type classifier
- deny-list for tool-derived personal data
Cache poisoning
Bad or malicious output gets cached and amplified.
Mitigation:
- store only after safety and quality checks
- approval gate for high-traffic canonical answers
- signed source versions
- admin purge by canonical intent
Rollout plan
Do not launch this by caching everything.
Week 1: measure repetition
Cluster requests offline:
- normalize questions
- embed them
- find top repeated intents
- estimate potential savings
- label risky categories
You want to know whether the “same 200 questions” claim is true in your logs. If it is, the project funds itself.
Week 2: exact cache and canonical catalog
Create the canonical question catalog:
- top intents
- approved answers
- source docs
- freshness class
- owner
- TTL
Start exact caching first. It is boring, which is perfect.
Week 3: semantic cache in shadow mode
Run semantic lookup but do not serve it yet.
For every request, record:
- nearest canonical question
- similarity score
- would-hit / would-miss
- human or automated correctness sample
Tune thresholds before users see anything.
Week 4: serve low-risk classes
Turn on semantic hits for:
- public docs
- FAQ
- onboarding help
- stable troubleshooting
Keep policy, account, and regulated content in shadow until confidence is high.
Week 5+: add dynamic slots and runtime cache-aware routing
Now layer in:
- live fact filling
- cache-aware routing
- prompt/KV cache metrics
- background refresh
- automated canonical-intent review
At this point the system should be cheaper, faster, and less chaotic.
The answer to the original problem
To reduce the cost by roughly 60%, you do not need to make the model cheaper first. You need to stop calling it for the same intent 60,000 times a day.
The solution is:
- Cluster repeated questions into canonical intents.
- Build a semantic response cache for safe, repeated intents.
- Attach source, policy, tenant, locale, prompt, and model metadata to each cached answer.
- Validate every hit through a freshness gate.
- Fill dynamic facts from live tools after the cache hit.
- Store misses only after quality checks.
- Use prompt/KV caching and cache-aware routing for the remaining misses.
That architecture can get close to the 60% theoretical savings on the repeated part of the workload while avoiding the two classic cache disasters: wrong answers and stale answers.
The best cache is not the one with the highest hit rate. It is the one that knows when to say:
Not this time. The model needs to think again.Sources worth reading
- Redis LangCache documentation for semantic response caching and cache-hit behavior.
- Redis semantic caching guide for embedding search, similarity thresholds, and precision-recall tradeoffs.
- vLLM Automatic Prefix Caching for KV reuse on shared prompt prefixes and its limits.
- NVIDIA Dynamo KV-aware router guide for routing based on KV overlap and load.
- TensorRT-LLM KV Cache System for KV reuse, offloading, and prioritized eviction.
- OpenAI Prompt Caching announcement for provider-side input-prefix caching economics and cache lifetime behavior.
- SGLang RadixAttention docs for prefix-tree based KV reuse across shared contexts.
