Reduce LLM Inference Cost by 60% Without Serving Stale Answers

#ai #inference #llm #semantic-cache #kv-cache #cost #routing #dynamo

Here is a very real production shape:

100,000 LLM queries / day
$0.40 average cost / query
= $40,000 / day

60,000 queries are slight variations of the same 200 questions

If every one of those requests goes to the model, the system is not being thoughtful. It is paying a very expensive intern to re-type the FAQ all day.

The tempting answer is “just cache it.” That is also how you accidentally serve yesterday’s policy, a stale price, a hallucinated product detail, or a response that was correct for one tenant but wrong for another.

The right answer is not a dumb cache. It is a freshness-aware semantic answer system:

semantic response cache for repeated questions written in different words
freshness contract attached to every cached answer
deterministic invalidation for source changes, policy changes, and tenant permissions
prompt/KV caching underneath for requests that still need model generation
cache-aware routing so misses are still cheaper and faster than random placement

That sounds like a lot. It is. But the shape is clean once you separate the layers.

The math: 60% is the ceiling, not the promise

If 60,000 of 100,000 daily requests are variants of the same 200 questions, the absolute best response-cache result is close to 60% model-call avoidance. Not exactly 60%, because you still need to generate or refresh canonical answers.

If those 200 canonical answers refresh once per day:

Before:
100,000 model calls x $0.40 = $40,000 / day

After:
40,000 unique calls
+ 200 canonical refresh calls
= 40,200 model calls

40,200 x $0.40 = $16,080 / day

Savings:
$23,920 / day
= 59.8% before cache infrastructure cost

If the same 200 answers refresh hourly:

40,000 unique calls
+ (200 canonical answers x 24 refreshes)
= 44,800 model calls

44,800 x $0.40 = $17,920 / day

Savings:
55.2% before cache infrastructure cost

That is still serious money. The point is not that every workload gets 60%. The point is that repeated intent changes the economics from “generate every time” to “generate when truth changes.”

The first-order savings come from avoiding duplicate generations. Runtime cache reuse is the second-order optimization.

Do not confuse the four caches

People say “cache” and then mix four different ideas:

Cache	What it saves	What it does not save
Exact response cache	Identical request text	Paraphrases
Semantic response cache	Similar user intent	Unsafe or stale matches
Retrieval/tool cache	Repeated context gathering	Final generation
Prompt/KV cache	Recomputing shared prefixes	Output-token generation

You need all four in serious systems, but they sit at different layers.

Semantic response caching is the big lever for the problem above. Redis describes this pattern as storing and reusing previous LLM responses for repeated queries, where paraphrases like “features of Product A” and “main features of Product A” can reuse the same answer when similarity is high enough. That is response-level avoidance.

Prompt and KV caching are different. OpenAI’s prompt caching discounts and speeds up reused input-token prefixes. vLLM Automatic Prefix Caching reuses KV cache when requests share a prefix, which helps prefill. TensorRT-LLM’s KV cache supports reuse across requests, offloading, and prioritized eviction. Dynamo’s router can use KV overlap and load to route requests toward workers with useful cached blocks.

Those runtime features are excellent. They just do not eliminate the whole generation for a paraphrased question. They make the miss path cheaper. The response cache removes the model call entirely when it is safe.

A production design should use response caching above the model and KV-aware routing below the gateway.

The production architecture

The architecture I would build has seven decisions in the request path.

1. Normalize the request

Normalize before lookup:

trim whitespace
canonicalize casing where safe
remove tracking noise
normalize product names and aliases
preserve locale, tenant, role, and entitlement data separately
keep the original text for observability and user experience

Do not over-normalize. “How do I reset my password?” and “Reset the password for user Alice” are not the same operation. Normalization should reduce noise, not erase intent.

2. Classify cache eligibility

Not every question deserves semantic caching.

Good candidates:

product FAQ
documentation Q&A
internal policy explanations
“how do I” help
troubleshooting steps with stable source docs
repeated onboarding questions

Bad candidates:

user-specific account state
legal, medical, or financial advice without review gates
live inventory, price, or incident status
queries containing secrets or sensitive personal data
long reasoning tasks where the wording changes the answer
anything requiring fresh tool execution

The cache should have a strong “no” muscle. A false miss costs money. A false hit costs trust.

3. Search exact cache first

Exact cache is cheap and safe. Use a key like:

sha256(
  model_family
  + prompt_template_version
  + tenant_id
  + locale
  + normalized_question
  + entitlement_scope
)

If it hits and the freshness contract is still valid, return it. No vector search needed.

4. Search semantic cache next

On an exact miss:

embed the normalized question
search nearest cached questions
apply a high similarity threshold
optionally rerank near-threshold candidates
verify tenant, locale, source version, policy version, and answer type

Redis’s semantic caching guide describes this shape: convert query to vector, run similarity search, and return a cached answer only when the similarity score exceeds the chosen threshold. It also calls out the precision-recall tradeoff: lower thresholds improve hit rate but increase wrong-answer risk.

I would start conservative:

Question class	Starting threshold	Why
FAQ / docs	0.90-0.95	High precision, easy wins
Support troubleshooting	0.88-0.93	Similar wording often maps well
Policy explanation	0.92-0.97	Stale or wrong policy is painful
User-specific task	disabled	Run the tool
Compliance-sensitive answer	disabled or human-reviewed	Do not gamble

This is not a universal table. It is a safe starting posture. Your telemetry should tune it.

5. Run the freshness gate

The cached answer is not just text. It is an object with a contract:

{
  "answer_id": "faq.billing.refunds.v14",
  "canonical_question": "How do refunds work?",
  "answer_text": "...",
  "source_ids": ["docs/billing/refunds.md"],
  "source_versions": ["sha256:7f3a..."],
  "prompt_template_version": "support-answer-v8",
  "model_family": "llama-3.1-70b",
  "tenant_scope": "public-docs",
  "locale": "en-US",
  "freshness_class": "policy",
  "expires_at": "2026-05-05T18:30:00Z",
  "last_verified_at": "2026-05-05T09:30:00Z"
}

The freshness gate checks:

source document hash still matches
policy version still matches
tenant permissions still match
locale still matches
model/prompt version is still allowed
answer has not expired
no red-team or quality rule has blocked this answer family

If any check fails, it is a cache miss. Smile politely and generate a new answer.

6. Compose dynamic slots late

This is how you avoid the “cached and stale” smell.

Do not cache:

"Your current balance is $184.22 and your renewal date is June 8."

Cache:

Answer plan:
- explain how billing renewal works
- include account balance from billing API
- include renewal date from subscription API
- link to billing settings

Then fill the dynamic slots from live systems after the cache hit.

This keeps the expensive language work cached while still making the answer feel current. The model wrote the stable explanation once. Your application fills the facts every time.

7. Store misses only after quality checks

A miss should not blindly populate the cache. Store only when:

generation completed successfully
answer passed policy and quality checks
sources are known
freshness class is assigned
tenant scope is safe
output is not too personalized
the route is worth caching

If a one-off weird request misses, do not memorialize it forever. Caches should have taste.

Similarity alone is not enough. The cache hit has to survive the freshness and permission gates.

The freshness contract

The trick is not “cache for 24 hours.” The trick is “cache according to the volatility of truth.”

Freshness class	Example	Cache strategy
Static	“How do I enable SSO?”	Long TTL, invalidate on doc update
Slow policy	“What is the refund policy?”	Medium TTL, invalidate on policy version
Dynamic slots	“When does my plan renew?”	Cache explanation, fill account data live
Live operational	“Is service X down?”	Do not cache final answer; cache tool result briefly
Regulated	“Can I do X with customer data?”	Short TTL, strict source pinning, optional review

This is where many systems fail. They cache text but not truth. The cached answer needs to know why it was true.

The answer should carry enough metadata for the gateway to prove it is still safe to serve.

How this reaches 60% without feeling cached

The user should not feel like they got a stale canned response. They should feel like the system was fast and competent.

That requires three design choices.

Cache canonical answers, not user phrasing

The user asks:

"Can I get money back if I cancel after two weeks?"

The canonical question might be:

"What is the refund policy after cancellation?"

Store the answer under the canonical intent. Keep the user’s original wording for logs and analytics, but do not make every phrasing a new product truth.

Return stable explanation, fill live facts

For policy, docs, and help, a cached explanation is fine.

For account-specific answers, split the response:

cached:
  explanation of how renewals work

live:
  user's renewal date
  current plan
  current balance
  eligible actions

This is the difference between “cached answer” and “cached brain with live eyes.”

Use stale-while-revalidate for popular intents

For the top 200 questions, do not wait for users to discover staleness.

Run a background job:

refresh popular intents on schedule
refresh immediately when source docs change
compare new answer to old answer
alert when the answer changed materially
keep the old answer only if its freshness contract still passes

For slow-changing docs, this keeps latency low. For policy changes, it gives you a controlled rollout instead of surprise.

Where NVIDIA-style runtime caching fits

The response cache is above the model. It decides whether the model should be called at all.

If the request misses, the runtime layer still matters a lot:

vLLM Automatic Prefix Caching can reuse KV cache for shared prompt prefixes and avoid recomputing the shared prefill.
SGLang RadixAttention organizes common prefixes in a radix tree so requests with shared context can reuse KV state.
TensorRT-LLM supports KV reuse across requests, plus tools such as offloading and prioritized eviction.
Dynamo adds KV-aware routing and distributed cache coordination, so misses can be routed toward workers with useful cache overlap while still respecting load.

This is the hierarchy I like:

semantic cache hit:
  skip model generation

semantic cache miss:
  use prompt/KV caching
  use KV-aware routing
  use batching and good inference kernels

That is a strong systems story. The gateway saves the obvious money. The inference runtime saves the remaining money. The GPU still does the hard work, but it stops doing the same hard work for no reason.

A practical scoring function

Semantic cache hit decisions should be boring and auditable:

eligible =
  cache_allowed
  and tenant_match
  and locale_match
  and source_versions_current
  and policy_version_current
  and similarity_score >= threshold
  and answer_type_safe
  and quality_status == "approved"

For near-threshold matches, use a second check:

if 0.86 <= similarity_score < 0.92:
    run lightweight reranker or verifier
else if similarity_score >= 0.92:
    serve if freshness passes
else:
    miss

Do not use one global threshold. A product FAQ and a compliance answer do not deserve the same risk setting.

What the API could look like

The gateway API can keep this simple:

{
  "query": "Can I get a refund if I cancel next week?",
  "tenant": "public",
  "locale": "en-US",
  "user_scope": ["docs:read"],
  "cache_policy": {
    "semantic": true,
    "min_similarity": 0.93,
    "allow_dynamic_slots": true,
    "max_staleness_seconds": 3600
  }
}

Response on hit:

{
  "answer": "Refund eligibility depends on your plan and cancellation window...",
  "cache": {
    "status": "semantic_hit",
    "similarity": 0.96,
    "canonical_question": "What is the refund policy after cancellation?",
    "freshness": "valid",
    "source_versions": ["sha256:7f3a..."]
  }
}

Response on miss:

{
  "answer": "Refund eligibility depends on your plan and cancellation window...",
  "cache": {
    "status": "miss",
    "stored": true,
    "freshness_class": "policy"
  }
}

Expose the cache metadata internally, not necessarily to end users. Operators need to know why an answer was served.

The dashboard I would build

If this system is saving tens of thousands of dollars a day, the dashboard should not be decorative.

Track:

request count
exact hit rate
semantic hit rate
miss rate
false-positive reports
stale-blocked hits
freshness-gate failures by reason
model calls avoided
estimated dollars avoided
p50 / p95 latency by hit type
answer refresh count
cache entries by freshness class
top canonical intents
cache hit rate by tenant and locale
post-hit user correction rate

The most important chart is not hit rate. It is safe hit rate:

safe hit rate =
  cache hits that passed freshness and quality gates
  / total requests

A high hit rate with wrong answers is not optimization. It is a very fast apology generator.

Failure modes to design for

False semantic match

Two questions look similar but need different answers.

Mitigation:

higher threshold
class-specific threshold
rerank near threshold
exact entity matching
user feedback loop
do not cache high-risk classes

Stale source

The source document changed after the answer was cached.

Mitigation:

source hash in cache record
event-driven invalidation
background refresh
stale-while-revalidate only for low-risk content

Tenant leakage

A cached answer from one customer is served to another.

Mitigation:

tenant in cache key
entitlement scope in cache key
no cross-tenant semantic cache unless content is explicitly public

Personalized answer cached as generic

The answer includes a user’s account state and gets reused.

Mitigation:

PII detection before store
dynamic slot pattern
answer-type classifier
deny-list for tool-derived personal data

Cache poisoning

Bad or malicious output gets cached and amplified.

Mitigation:

store only after safety and quality checks
approval gate for high-traffic canonical answers
signed source versions
admin purge by canonical intent

Rollout plan

Do not launch this by caching everything.

Week 1: measure repetition

Cluster requests offline:

normalize questions
embed them
find top repeated intents
estimate potential savings
label risky categories

You want to know whether the “same 200 questions” claim is true in your logs. If it is, the project funds itself.

Week 2: exact cache and canonical catalog

Create the canonical question catalog:

top intents
approved answers
source docs
freshness class
owner
TTL

Start exact caching first. It is boring, which is perfect.

Week 3: semantic cache in shadow mode

Run semantic lookup but do not serve it yet.

For every request, record:

nearest canonical question
similarity score
would-hit / would-miss
human or automated correctness sample

Tune thresholds before users see anything.

Week 4: serve low-risk classes

Turn on semantic hits for:

public docs
FAQ
onboarding help
stable troubleshooting

Keep policy, account, and regulated content in shadow until confidence is high.

Week 5+: add dynamic slots and runtime cache-aware routing

Now layer in:

live fact filling
cache-aware routing
prompt/KV cache metrics
background refresh
automated canonical-intent review

At this point the system should be cheaper, faster, and less chaotic.

The answer to the original problem

To reduce the cost by roughly 60%, you do not need to make the model cheaper first. You need to stop calling it for the same intent 60,000 times a day.

The solution is:

Cluster repeated questions into canonical intents.
Build a semantic response cache for safe, repeated intents.
Attach source, policy, tenant, locale, prompt, and model metadata to each cached answer.
Validate every hit through a freshness gate.
Fill dynamic facts from live tools after the cache hit.
Store misses only after quality checks.
Use prompt/KV caching and cache-aware routing for the remaining misses.

That architecture can get close to the 60% theoretical savings on the repeated part of the workload while avoiding the two classic cache disasters: wrong answers and stale answers.

The best cache is not the one with the highest hit rate. It is the one that knows when to say:

Not this time. The model needs to think again.

Sources worth reading

Redis LangCache documentation for semantic response caching and cache-hit behavior.
Redis semantic caching guide for embedding search, similarity thresholds, and precision-recall tradeoffs.
vLLM Automatic Prefix Caching for KV reuse on shared prompt prefixes and its limits.
NVIDIA Dynamo KV-aware router guide for routing based on KV overlap and load.
TensorRT-LLM KV Cache System for KV reuse, offloading, and prioritized eviction.
OpenAI Prompt Caching announcement for provider-side input-prefix caching economics and cache lifetime behavior.
SGLang RadixAttention docs for prefix-tree based KV reuse across shared contexts.

Why Agentic AI Is Bringing CPUs Back Into the Spotlight Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste