Production LLM Systems Tutorial 5: Monitoring and Observability
Tutorial Series
- End-to-End Application Design
- Latency, Cost, and Quality
- Scalable Inference Architecture
- RAG and Data Pipelines
- Monitoring and Observability
- Evaluation and A/B Testing
- Security and Prompt Injection
- Human-in-the-Loop Workflows
- Cost Optimization
- Versioning and Disaster Recovery
LLM observability is not a dashboard with token counts.
A production LLM system is a distributed workflow. One user request can include retrieval, reranking, prompt assembly, one or more model calls, tool execution, output validation, policy checks, streaming, and feedback capture. If you only log the final answer, you cannot debug the system.
This tutorial builds an observability model with three layers: system, quality, and drift.
Layer 1: System observability
System metrics answer: “Is the service healthy?”
Track:
| Metric | Why it matters |
|---|---|
| TTFT | User-perceived responsiveness |
| TPOT | Streaming smoothness and decode health |
| p50/p95/p99 latency | Normal and tail behavior |
| tokens/sec | Serving throughput |
| queue depth | Saturation and under-provisioning |
| error rate | Provider, tool, validation, and policy failures |
| retry count | Hidden cost and latency multiplier |
| GPU utilization | Whether accelerators are doing useful work |
| KV cache hit rate | Prefix reuse and routing quality |
| cache hit rate | Exact and semantic cache efficiency |
| cost per request | Unit economics |
Split latency by span. A single “latency_ms” metric is not enough.
request
retrieval: 430ms
rerank: 170ms
model prefill: 620ms
first token: 790ms
tool call: 2.4s
final stream: 4.8sThat trace tells you what to fix.
Layer 2: Quality observability
Quality metrics answer: “Is the system useful and safe?”
Track:
| Signal | Example measurement |
|---|---|
| Groundedness | Does the answer stay within retrieved evidence? |
| Answer relevance | Does it answer the user question? |
| Refusal quality | Does it refuse only when appropriate? |
| Tool success | Did tool calls complete with valid arguments? |
| Citation accuracy | Do citations support the claims? |
| User feedback | Thumbs, edits, re-asks, escalation |
| Human review outcome | Accepted, corrected, rejected |
LLM-as-judge can help, but do not treat it as truth. Judges have bias. They can prefer longer answers, familiar wording, and their own model family’s style. Calibrate judges against human-labeled examples and use pairwise comparisons when subjective quality matters.
Layer 3: Drift observability
Drift metrics answer: “Is the input or output distribution changing?”
Watch:
- input topic distribution
- language mix
- document corpus changes
- embedding distribution
- retrieval score distribution
- output length
- refusal rate
- tool-call frequency
- cost per task
Example: if retrieval scores drop after a document migration, the model may still answer fluently, but evidence quality has degraded. That is a RAG incident, even if uptime is perfect.
Trace schema
Use one trace per user request and child spans for each operation:
trace: request_id=req_123
span: gateway.auth
span: gateway.rate_limit
span: orchestrator.route
span: retrieval.query_rewrite
span: retrieval.vector_search
span: retrieval.bm25_search
span: retrieval.rerank
span: prompt.assemble
span: model.call
span: tool.execute
span: output.validate
span: stream.sendEach span should include enough metadata to debug behavior without storing raw secrets:
{
"tenant_id": "tenant_a",
"feature": "support_answer",
"prompt_version": "support_v14",
"model": "model_route_mid",
"input_tokens": 1840,
"output_tokens": 322,
"cache_hit": false,
"retrieved_documents": 5,
"tool_count": 1
}Privacy rules
Prompt and response capture is sensitive. Before storing traces:
- redact PII
- hash user identifiers where possible
- separate raw payload storage from metrics
- apply retention limits
- restrict trace access by tenant and environment
- avoid logging secrets in tool arguments
For high-risk domains, store structured summaries and references instead of raw prompts by default.
Alerting
Useful alerts are tied to user impact:
| Alert | Likely issue |
|---|---|
| TTFT p95 doubled | Queueing, provider degradation, prefill overload |
| TPOT p95 increased | Decode saturation, KV pressure, model route regression |
| Retrieval recall eval dropped | Corpus or embedding issue |
| Refusal rate spiked | Safety policy or classifier regression |
| Cost per task rose | Retry loop, prompt bloat, routing regression |
| Tool failures rose | Downstream dependency or schema drift |
Do not alert on every odd model answer. Use sampling, eval thresholds, and aggregate trends.
Tooling
Common tools include Langfuse, LangSmith, Arize Phoenix, Helicone, Weights & Biases, and OpenTelemetry-based pipelines. The exact vendor matters less than the trace model. If your trace does not show retrieval, prompt version, model route, tool calls, token usage, and cost, you will still be debugging blind.
Sources and receipts
- OpenTelemetry, “Semantic conventions for generative AI systems”: https://opentelemetry.io/docs/specs/semconv/gen-ai/
- Langfuse, “LLM Observability and Application Tracing”: https://langfuse.com/docs/observability/overview
- Arize Phoenix documentation: https://arize.com/docs/phoenix
- Helicone documentation: https://docs.helicone.ai/getting-started/platform-overview
