Skip to content
Production LLM Systems Tutorial 9: Cost Optimization

Production LLM Systems Tutorial 9: Cost Optimization

Tutorial Series

  1. End-to-End Application Design
  2. Latency, Cost, and Quality
  3. Scalable Inference Architecture
  4. RAG and Data Pipelines
  5. Monitoring and Observability
  6. Evaluation and A/B Testing
  7. Security and Prompt Injection
  8. Human-in-the-Loop Workflows
  9. Cost Optimization
  10. Versioning and Disaster Recovery

LLM cost optimization is not one trick. It is a stack.

The biggest savings usually come from removing waste before tuning inference. Duplicate retries, oversized context, no routing, missing cache keys, and unbounded output tokens can cost more than the model choice itself.

This tutorial builds a cost-control system.

LLM cost optimization stack with attribution, request guardrails, cache, routing, context budgeting, batch processing, and quantization
Cost optimization compounds when every request has attribution and every savings layer is measured.

Step 1: Attribute cost

Before optimizing, tag every request:

{
  "tenant_id": "tenant_a",
  "feature": "support_answer",
  "route": "rag_mid_model",
  "prompt_version": "support_v14",
  "model": "model_route_mid",
  "input_tokens": 1840,
  "output_tokens": 322,
  "cached_tokens": 0,
  "cost_usd": 0.0142,
  "success": true
}

Dashboards should show:

  • cost by tenant
  • cost by feature
  • cost by route
  • cost by prompt version
  • cost by model
  • cost per successful task
  • retry cost
  • cache savings
  • token budget violations

If you cannot answer “which feature spent the money,” you do not have FinOps. You have a bill.

Step 2: Stop duplicate generation

Duplicate retries are silent cost leaks.

Use:

  • idempotency keys
  • request deduplication
  • stream resume where possible
  • retry budgets
  • circuit breakers
  • cached final result by request id

When a client times out after 15 seconds and retries, the backend should not blindly generate again. It should check whether the first request is still running or already completed.

Step 3: Route easy work to cheaper models

Model routing often saves more than prompt tuning.

Routes:

WorkloadCandidate route
Classificationsmall model or embedding classifier
Extractionsmall or mid model with schema validation
Simple FAQcache or small model
RAG synthesismid model
Complex reasoninglarge model
High-risk final answerlarge model plus review

Measure cost per successful task, not only cost per request.

Step 4: Use cache layers

Use exact cache first:

cache_key = hash(
  tenant_id,
  user_scope,
  prompt_version,
  model_version,
  normalized_input,
  corpus_version
)

Then add semantic cache only for domains where freshness is controlled. A semantic cache should store:

  • original query
  • normalized query
  • embedding model
  • answer
  • evidence ids
  • corpus version
  • expiration
  • safety policy version

Cache invalidation is a product policy. Pricing answers, legal policies, inventory, and incident status need different TTLs.

Step 5: Budget context

Long prompts are often self-inflicted.

Set budgets:

system prompt: 400 tokens
developer policy: 500 tokens
conversation window: 1500 tokens
retrieved context: 3500 tokens
tool results: 1000 tokens
output max: 800 tokens

Then enforce them. Retrieval should return the best evidence that fits, not every chunk that matched. Tool results should be summarized or structured. Conversation history should be compacted.

Step 6: Use batch APIs for offline jobs

Not all work is interactive.

Good batch candidates:

  • offline evals
  • document classification
  • embedding corpus jobs
  • nightly summarization
  • backfills
  • report generation

OpenAI’s Batch API, for example, is designed for asynchronous jobs with lower cost and a 24-hour completion window. That is not suitable for chat UX, but it is ideal for evals and corpus processing.

Step 7: Quantize self-hosted models

For self-hosting, quantization can reduce memory footprint and increase serving density. It can also change quality. Treat quantization as a route-specific release:

  1. Pick target model and quantization method.
  2. Run golden evals.
  3. Run safety evals.
  4. Run latency and throughput benchmarks.
  5. Canary low-risk traffic.
  6. Compare cost per successful task.

Do not approve quantization from throughput alone.

Step 8: Add guardrails

Cost guardrails:

  • max input tokens per route
  • max output tokens per route
  • max tool steps
  • max retries
  • max cost per request
  • daily tenant budget
  • feature-level budget
  • alert on cost anomaly

If a user asks for a 200-page answer, the system should negotiate scope instead of obeying.

A sample savings stack

Example starting point:

100,000 requests/day
average cost: $0.020
daily cost: $2,000

After changes:

ChangeSavings
Deduplicate retries8%
Route easy work45%
Exact cache12%
Semantic cache on FAQ route18%
Context budget20%
Batch offline jobs50% on eligible work

Savings compound, but not perfectly. Measure each layer with attribution.

LLM cost savings waterfall from baseline cost through deduplication, routing, caching, context budgeting, and serving efficiency
A savings waterfall keeps the conversation tied to measured cost per successful task.

Sources and receipts