Skip to content
Reduce LLM Inference Cost by 60% Without Serving Stale Answers

Reduce LLM Inference Cost by 60% Without Serving Stale Answers

Here is a very real production shape:

100,000 LLM queries / day
$0.40 average cost / query
= $40,000 / day

60,000 queries are slight variations of the same 200 questions

If every one of those requests goes to the model, the system is not being thoughtful. It is paying a very expensive intern to re-type the FAQ all day.

The tempting answer is “just cache it.” That is also how you accidentally serve yesterday’s policy, a stale price, a hallucinated product detail, or a response that was correct for one tenant but wrong for another.

The right answer is not a dumb cache. It is a freshness-aware semantic answer system:

  • semantic response cache for repeated questions written in different words
  • freshness contract attached to every cached answer
  • deterministic invalidation for source changes, policy changes, and tenant permissions
  • prompt/KV caching underneath for requests that still need model generation
  • cache-aware routing so misses are still cheaper and faster than random placement

That sounds like a lot. It is. But the shape is clean once you separate the layers.

The math: 60% is the ceiling, not the promise

If 60,000 of 100,000 daily requests are variants of the same 200 questions, the absolute best response-cache result is close to 60% model-call avoidance. Not exactly 60%, because you still need to generate or refresh canonical answers.

If those 200 canonical answers refresh once per day:

Before:
100,000 model calls x $0.40 = $40,000 / day

After:
40,000 unique calls
+ 200 canonical refresh calls
= 40,200 model calls

40,200 x $0.40 = $16,080 / day

Savings:
$23,920 / day
= 59.8% before cache infrastructure cost

If the same 200 answers refresh hourly:

40,000 unique calls
+ (200 canonical answers x 24 refreshes)
= 44,800 model calls

44,800 x $0.40 = $17,920 / day

Savings:
55.2% before cache infrastructure cost

That is still serious money. The point is not that every workload gets 60%. The point is that repeated intent changes the economics from “generate every time” to “generate when truth changes.”

Inference cost reduction from semantic cachingA diagram showing 100,000 daily requests, 60,000 repeated intent requests, and the reduction in model calls after semantic caching.The repeated-intent mathMost of the savings comes from avoiding repeated generation, not from squeezing pennies out of the same generation path.Before100Kmodel calls / day$40Kat $0.40 per queryFind repetition60Kqueries map to200canonical questionsAfter40.2Kmodel calls$16.1Kdaily refreshmaximum practical win: avoid repeat model calls
The first-order savings come from avoiding duplicate generations. Runtime cache reuse is the second-order optimization.

Do not confuse the four caches

People say “cache” and then mix four different ideas:

CacheWhat it savesWhat it does not save
Exact response cacheIdentical request textParaphrases
Semantic response cacheSimilar user intentUnsafe or stale matches
Retrieval/tool cacheRepeated context gatheringFinal generation
Prompt/KV cacheRecomputing shared prefixesOutput-token generation

You need all four in serious systems, but they sit at different layers.

Semantic response caching is the big lever for the problem above. Redis describes this pattern as storing and reusing previous LLM responses for repeated queries, where paraphrases like “features of Product A” and “main features of Product A” can reuse the same answer when similarity is high enough. That is response-level avoidance.

Prompt and KV caching are different. OpenAI’s prompt caching discounts and speeds up reused input-token prefixes. vLLM Automatic Prefix Caching reuses KV cache when requests share a prefix, which helps prefill. TensorRT-LLM’s KV cache supports reuse across requests, offloading, and prioritized eviction. Dynamo’s router can use KV overlap and load to route requests toward workers with useful cached blocks.

Those runtime features are excellent. They just do not eliminate the whole generation for a paraphrased question. They make the miss path cheaper. The response cache removes the model call entirely when it is safe.

Caching layers for production LLM systemsFour cache layers showing exact answer cache, semantic answer cache, retrieval cache, and prompt or KV cache.One user request, four chances to avoid wasteThe response cache saves the most. The runtime caches make the remaining misses faster.Exact cacheSame normalized inputhash hitFast, safe, limited.Semantic cacheSame intent, new wordsvector hitLargest cost lever forrepeated questions.Context cacheRAG, tools, policyfacts hitKeeps grounding fast.Prompt / KVShared token prefixesprefill hitGreat for misses andmulti-turn sessions.serve from cache when safe; generate only when needed
A production design should use response caching above the model and KV-aware routing below the gateway.

The production architecture

The architecture I would build has seven decisions in the request path.

1. Normalize the request

Normalize before lookup:

  • trim whitespace
  • canonicalize casing where safe
  • remove tracking noise
  • normalize product names and aliases
  • preserve locale, tenant, role, and entitlement data separately
  • keep the original text for observability and user experience

Do not over-normalize. “How do I reset my password?” and “Reset the password for user Alice” are not the same operation. Normalization should reduce noise, not erase intent.

2. Classify cache eligibility

Not every question deserves semantic caching.

Good candidates:

  • product FAQ
  • documentation Q&A
  • internal policy explanations
  • “how do I” help
  • troubleshooting steps with stable source docs
  • repeated onboarding questions

Bad candidates:

  • user-specific account state
  • legal, medical, or financial advice without review gates
  • live inventory, price, or incident status
  • queries containing secrets or sensitive personal data
  • long reasoning tasks where the wording changes the answer
  • anything requiring fresh tool execution

The cache should have a strong “no” muscle. A false miss costs money. A false hit costs trust.

3. Search exact cache first

Exact cache is cheap and safe. Use a key like:

sha256(
  model_family
  + prompt_template_version
  + tenant_id
  + locale
  + normalized_question
  + entitlement_scope
)

If it hits and the freshness contract is still valid, return it. No vector search needed.

4. Search semantic cache next

On an exact miss:

  1. embed the normalized question
  2. search nearest cached questions
  3. apply a high similarity threshold
  4. optionally rerank near-threshold candidates
  5. verify tenant, locale, source version, policy version, and answer type

Redis’s semantic caching guide describes this shape: convert query to vector, run similarity search, and return a cached answer only when the similarity score exceeds the chosen threshold. It also calls out the precision-recall tradeoff: lower thresholds improve hit rate but increase wrong-answer risk.

I would start conservative:

Question classStarting thresholdWhy
FAQ / docs0.90-0.95High precision, easy wins
Support troubleshooting0.88-0.93Similar wording often maps well
Policy explanation0.92-0.97Stale or wrong policy is painful
User-specific taskdisabledRun the tool
Compliance-sensitive answerdisabled or human-reviewedDo not gamble

This is not a universal table. It is a safe starting posture. Your telemetry should tune it.

5. Run the freshness gate

The cached answer is not just text. It is an object with a contract:

{
  "answer_id": "faq.billing.refunds.v14",
  "canonical_question": "How do refunds work?",
  "answer_text": "...",
  "source_ids": ["docs/billing/refunds.md"],
  "source_versions": ["sha256:7f3a..."],
  "prompt_template_version": "support-answer-v8",
  "model_family": "llama-3.1-70b",
  "tenant_scope": "public-docs",
  "locale": "en-US",
  "freshness_class": "policy",
  "expires_at": "2026-05-05T18:30:00Z",
  "last_verified_at": "2026-05-05T09:30:00Z"
}

The freshness gate checks:

  • source document hash still matches
  • policy version still matches
  • tenant permissions still match
  • locale still matches
  • model/prompt version is still allowed
  • answer has not expired
  • no red-team or quality rule has blocked this answer family

If any check fails, it is a cache miss. Smile politely and generate a new answer.

6. Compose dynamic slots late

This is how you avoid the “cached and stale” smell.

Do not cache:

"Your current balance is $184.22 and your renewal date is June 8."

Cache:

Answer plan:
- explain how billing renewal works
- include account balance from billing API
- include renewal date from subscription API
- link to billing settings

Then fill the dynamic slots from live systems after the cache hit.

This keeps the expensive language work cached while still making the answer feel current. The model wrote the stable explanation once. Your application fills the facts every time.

7. Store misses only after quality checks

A miss should not blindly populate the cache. Store only when:

  • generation completed successfully
  • answer passed policy and quality checks
  • sources are known
  • freshness class is assigned
  • tenant scope is safe
  • output is not too personalized
  • the route is worth caching

If a one-off weird request misses, do not memorialize it forever. Caches should have taste.

Freshness-aware semantic cache pipelineA request flows through normalization, eligibility, exact cache, semantic cache, freshness gate, dynamic slot filling, and model generation on miss.The safe cache-hit pathA semantic match is only the first yes. Freshness, permission, and dynamic facts still have to pass.Requestuser wordsNormalizekeep scopeEligibilitysafe to cache?Exact hithash matchSemantic hitmeaning matchFreshnesssource + policyGeneratemiss pathPrompt / KV cachereuse prefixes on missCache-aware routingsend to warm workershit: validate and serve; miss: generate, evaluate, store only if useful
Similarity alone is not enough. The cache hit has to survive the freshness and permission gates.

The freshness contract

The trick is not “cache for 24 hours.” The trick is “cache according to the volatility of truth.”

Freshness classExampleCache strategy
Static“How do I enable SSO?”Long TTL, invalidate on doc update
Slow policy“What is the refund policy?”Medium TTL, invalidate on policy version
Dynamic slots“When does my plan renew?”Cache explanation, fill account data live
Live operational“Is service X down?”Do not cache final answer; cache tool result briefly
Regulated“Can I do X with customer data?”Short TTL, strict source pinning, optional review

This is where many systems fail. They cache text but not truth. The cached answer needs to know why it was true.

Freshness contract for cached answersA cached answer is surrounded by source, policy, tenant, model, TTL, and quality metadata before it can be served.Cache the answer with its proof of freshnessA cached answer is safe only while its sources, permissions, and policy contract remain valid.Cached answertext + citations + answer planSource versionsTenant scopePolicy versionPrompt versionExpiresQuality OK
The answer should carry enough metadata for the gateway to prove it is still safe to serve.

How this reaches 60% without feeling cached

The user should not feel like they got a stale canned response. They should feel like the system was fast and competent.

That requires three design choices.

Cache canonical answers, not user phrasing

The user asks:

"Can I get money back if I cancel after two weeks?"

The canonical question might be:

"What is the refund policy after cancellation?"

Store the answer under the canonical intent. Keep the user’s original wording for logs and analytics, but do not make every phrasing a new product truth.

Return stable explanation, fill live facts

For policy, docs, and help, a cached explanation is fine.

For account-specific answers, split the response:

cached:
  explanation of how renewals work

live:
  user's renewal date
  current plan
  current balance
  eligible actions

This is the difference between “cached answer” and “cached brain with live eyes.”

Use stale-while-revalidate for popular intents

For the top 200 questions, do not wait for users to discover staleness.

Run a background job:

  • refresh popular intents on schedule
  • refresh immediately when source docs change
  • compare new answer to old answer
  • alert when the answer changed materially
  • keep the old answer only if its freshness contract still passes

For slow-changing docs, this keeps latency low. For policy changes, it gives you a controlled rollout instead of surprise.

Where NVIDIA-style runtime caching fits

The response cache is above the model. It decides whether the model should be called at all.

If the request misses, the runtime layer still matters a lot:

  • vLLM Automatic Prefix Caching can reuse KV cache for shared prompt prefixes and avoid recomputing the shared prefill.
  • SGLang RadixAttention organizes common prefixes in a radix tree so requests with shared context can reuse KV state.
  • TensorRT-LLM supports KV reuse across requests, plus tools such as offloading and prioritized eviction.
  • Dynamo adds KV-aware routing and distributed cache coordination, so misses can be routed toward workers with useful cache overlap while still respecting load.

This is the hierarchy I like:

semantic cache hit:
  skip model generation

semantic cache miss:
  use prompt/KV caching
  use KV-aware routing
  use batching and good inference kernels

That is a strong systems story. The gateway saves the obvious money. The inference runtime saves the remaining money. The GPU still does the hard work, but it stops doing the same hard work for no reason.

A practical scoring function

Semantic cache hit decisions should be boring and auditable:

eligible =
  cache_allowed
  and tenant_match
  and locale_match
  and source_versions_current
  and policy_version_current
  and similarity_score >= threshold
  and answer_type_safe
  and quality_status == "approved"

For near-threshold matches, use a second check:

if 0.86 <= similarity_score < 0.92:
    run lightweight reranker or verifier
else if similarity_score >= 0.92:
    serve if freshness passes
else:
    miss

Do not use one global threshold. A product FAQ and a compliance answer do not deserve the same risk setting.

What the API could look like

The gateway API can keep this simple:

{
  "query": "Can I get a refund if I cancel next week?",
  "tenant": "public",
  "locale": "en-US",
  "user_scope": ["docs:read"],
  "cache_policy": {
    "semantic": true,
    "min_similarity": 0.93,
    "allow_dynamic_slots": true,
    "max_staleness_seconds": 3600
  }
}

Response on hit:

{
  "answer": "Refund eligibility depends on your plan and cancellation window...",
  "cache": {
    "status": "semantic_hit",
    "similarity": 0.96,
    "canonical_question": "What is the refund policy after cancellation?",
    "freshness": "valid",
    "source_versions": ["sha256:7f3a..."]
  }
}

Response on miss:

{
  "answer": "Refund eligibility depends on your plan and cancellation window...",
  "cache": {
    "status": "miss",
    "stored": true,
    "freshness_class": "policy"
  }
}

Expose the cache metadata internally, not necessarily to end users. Operators need to know why an answer was served.

The dashboard I would build

If this system is saving tens of thousands of dollars a day, the dashboard should not be decorative.

Track:

  • request count
  • exact hit rate
  • semantic hit rate
  • miss rate
  • false-positive reports
  • stale-blocked hits
  • freshness-gate failures by reason
  • model calls avoided
  • estimated dollars avoided
  • p50 / p95 latency by hit type
  • answer refresh count
  • cache entries by freshness class
  • top canonical intents
  • cache hit rate by tenant and locale
  • post-hit user correction rate

The most important chart is not hit rate. It is safe hit rate:

safe hit rate =
  cache hits that passed freshness and quality gates
  / total requests

A high hit rate with wrong answers is not optimization. It is a very fast apology generator.

Failure modes to design for

False semantic match

Two questions look similar but need different answers.

Mitigation:

  • higher threshold
  • class-specific threshold
  • rerank near threshold
  • exact entity matching
  • user feedback loop
  • do not cache high-risk classes

Stale source

The source document changed after the answer was cached.

Mitigation:

  • source hash in cache record
  • event-driven invalidation
  • background refresh
  • stale-while-revalidate only for low-risk content

Tenant leakage

A cached answer from one customer is served to another.

Mitigation:

  • tenant in cache key
  • entitlement scope in cache key
  • no cross-tenant semantic cache unless content is explicitly public

Personalized answer cached as generic

The answer includes a user’s account state and gets reused.

Mitigation:

  • PII detection before store
  • dynamic slot pattern
  • answer-type classifier
  • deny-list for tool-derived personal data

Cache poisoning

Bad or malicious output gets cached and amplified.

Mitigation:

  • store only after safety and quality checks
  • approval gate for high-traffic canonical answers
  • signed source versions
  • admin purge by canonical intent

Rollout plan

Do not launch this by caching everything.

Week 1: measure repetition

Cluster requests offline:

  • normalize questions
  • embed them
  • find top repeated intents
  • estimate potential savings
  • label risky categories

You want to know whether the “same 200 questions” claim is true in your logs. If it is, the project funds itself.

Week 2: exact cache and canonical catalog

Create the canonical question catalog:

  • top intents
  • approved answers
  • source docs
  • freshness class
  • owner
  • TTL

Start exact caching first. It is boring, which is perfect.

Week 3: semantic cache in shadow mode

Run semantic lookup but do not serve it yet.

For every request, record:

  • nearest canonical question
  • similarity score
  • would-hit / would-miss
  • human or automated correctness sample

Tune thresholds before users see anything.

Week 4: serve low-risk classes

Turn on semantic hits for:

  • public docs
  • FAQ
  • onboarding help
  • stable troubleshooting

Keep policy, account, and regulated content in shadow until confidence is high.

Week 5+: add dynamic slots and runtime cache-aware routing

Now layer in:

  • live fact filling
  • cache-aware routing
  • prompt/KV cache metrics
  • background refresh
  • automated canonical-intent review

At this point the system should be cheaper, faster, and less chaotic.

The answer to the original problem

To reduce the cost by roughly 60%, you do not need to make the model cheaper first. You need to stop calling it for the same intent 60,000 times a day.

The solution is:

  1. Cluster repeated questions into canonical intents.
  2. Build a semantic response cache for safe, repeated intents.
  3. Attach source, policy, tenant, locale, prompt, and model metadata to each cached answer.
  4. Validate every hit through a freshness gate.
  5. Fill dynamic facts from live tools after the cache hit.
  6. Store misses only after quality checks.
  7. Use prompt/KV caching and cache-aware routing for the remaining misses.

That architecture can get close to the 60% theoretical savings on the repeated part of the workload while avoiding the two classic cache disasters: wrong answers and stale answers.

The best cache is not the one with the highest hit rate. It is the one that knows when to say:

Not this time. The model needs to think again.

Sources worth reading