Skip to content
Production LLM Systems Tutorial 2: Latency, Cost, and Quality

Production LLM Systems Tutorial 2: Latency, Cost, and Quality

Tutorial Series

  1. End-to-End Application Design
  2. Latency, Cost, and Quality
  3. Scalable Inference Architecture
  4. RAG and Data Pipelines
  5. Monitoring and Observability
  6. Evaluation and A/B Testing
  7. Security and Prompt Injection
  8. Human-in-the-Loop Workflows
  9. Cost Optimization
  10. Versioning and Disaster Recovery

Every production LLM system lives inside a triangle:

          quality
            /\
           /  \
          /    \
         /      \
 latency -------- cost

You can improve one corner for free during a demo. At production scale, every improvement moves pressure somewhere else. A larger model improves answer quality but raises latency and cost. A semantic cache cuts cost but can serve stale answers. Quantization improves capacity but may damage hard reasoning. Prompt compression lowers token spend but can remove the detail the model needed.

This tutorial gives you the control panel.

Latency, cost, and quality triangle with optimization levers including routing, caching, batching, compression, quantization, and evaluation
Latency, cost, and quality are coupled. Every optimization needs a measurement plan.

Step 1: Measure the right latency

Do not use only “average latency.” It hides what users actually feel.

Use these metrics:

MetricMeaningWhy it matters
TTFTTime to first tokenDominates chat perceived responsiveness
TPOTTime per output tokenDetermines streaming smoothness
p50Median latencyUseful for baseline behavior
p95/p99Tail latencyShows queueing, cold starts, overload, and provider variance
End-to-end task timeTotal time to useful outcomeCaptures retrieval, tools, model calls, and retries

If a request calls tools, do not blame the model until you split the trace:

request total: 8.2s
  auth/rate limit: 12ms
  retrieval: 420ms
  rerank: 180ms
  model TTFT: 760ms
  model stream: 3.1s
  tool call: 3.5s
  validation: 90ms

This trace says the tool is the bottleneck, not the model.

Step 2: Route work by difficulty

The cheapest optimization is not using the expensive model for easy work.

Start with a router:

request
  -> classify difficulty and risk
  -> easy factual task: small model
  -> normal assistant task: mid model
  -> complex reasoning or high-risk answer: large model
  -> unsafe or unsupported: refuse or escalate

Routing can be rules, a small classifier, or a cheap model. The router should look at task type, required tools, tenant policy, maximum budget, and confidence from previous attempts.

A common pattern:

  1. Try a small model for extraction, classification, formatting, and simple summarization.
  2. Use a mid-size model for most chat and RAG synthesis.
  3. Reserve the strongest model for hard reasoning, ambiguous user intent, and high-value workflows.
  4. Escalate only when evals prove escalation improves outcomes.

Routing without evals is just cost cutting with nicer words.

Step 3: Cache only what is safe to cache

Use two cache layers:

CacheKeyBest forMain risk
Exact cachePrompt hash plus model and prompt versionRepeated deterministic requestsMisses paraphrases
Semantic cacheEmbedding similarity plus tenant and policy scopeSimilar user questionsStale or wrong reuse

For semantic cache, the threshold is a product decision. A threshold near 0.95 may work for FAQ-style questions. Lower thresholds need stronger validation and narrower domains.

Never cache across tenants unless the answer is explicitly public and policy-approved. Cache keys should include:

  • tenant id
  • user permission scope
  • prompt version
  • model version
  • retrieval corpus version
  • safety policy version

If any of those change, the cache entry may be invalid.

Step 4: Compress context carefully

Prompt compression is useful when long context dominates cost. Techniques like LLMLingua show that prompts can often be shortened substantially while preserving task performance, but compression is not free.

Use compression when:

  • context is long and redundant
  • the answer only needs a few facts
  • latency or cost is constrained
  • the compression model is cheaper than the saved generation cost

Avoid compression when:

  • legal wording matters
  • code details matter
  • the task depends on edge cases
  • retrieval already produced tight context

The safe pattern is “retrieve small, expand parent, then compress only if the packed context exceeds budget.”

Step 5: Use batching for throughput, not magic

Self-hosted inference depends on batching. Static batching waits for a group and processes them together. Continuous batching, used by systems such as vLLM, keeps adding and removing requests as sequences progress.

Continuous batching helps because generation is uneven. One user may request 20 output tokens. Another may request 2,000. Static batches make fast requests wait behind slow ones.

Batching has a downside: queueing. If your system waits too long to build a large batch, TTFT suffers. That is why serving systems need a scheduler, not just a batch size.

Track:

  • queue depth
  • waiting time before prefill
  • active sequences
  • KV cache pressure
  • tokens per second
  • TTFT and TPOT by route

Step 6: Quantize with evals, not hope

Quantization reduces memory and often improves serving capacity. The path usually goes from FP16/BF16 to FP8, INT8, or INT4-style weight quantization.

The trade-off is task-dependent. A quantized larger model can outperform a smaller full-precision model, but the answer depends on model family, bit width, calibration, task difficulty, and serving engine.

Run evals for:

  • instruction following
  • hallucination sensitivity
  • tool argument formatting
  • domain vocabulary
  • long-context retrieval
  • safety refusals

Do not approve quantization because benchmark throughput improved. Approve it because the eval suite says the quality loss is acceptable for that route.

Step 7: Try speculative decoding when decode dominates

Speculative decoding uses a cheaper draft model to propose tokens and a larger target model to verify them. It works best when decode is the bottleneck and the draft model’s predictions are frequently accepted.

It works poorly when:

  • prompts are huge and prefill dominates
  • the draft model is too slow
  • acceptance rate is low
  • the workload is tool-latency-bound
  • memory pressure from the draft model reduces batch size

Measure before and after:

baseline:
  TTFT 820ms
  TPOT 42ms
  cost/request $0.018

with speculation:
  TTFT 910ms
  TPOT 24ms
  cost/request $0.020

That may be worth it for long generations and not worth it for short answers.

The optimization order

Use this order for most applications:

  1. Reduce unnecessary calls with product flow changes.
  2. Add exact cache for deterministic and repeated requests.
  3. Add model routing.
  4. Tighten retrieval and context packing.
  5. Add semantic cache where freshness can be controlled.
  6. Use batch processing for offline jobs.
  7. Tune serving engine, batching, and KV cache.
  8. Evaluate quantization.
  9. Evaluate speculative decoding.
  10. Revisit prompt compression for long-context routes.

Do not start with exotic serving tricks if half your traffic is duplicate retries.

LLM optimization order from removing waste to caching, routing, context packing, and advanced serving techniques
Start with waste removal and caching before advanced serving techniques.

Sources and receipts