Skip to content
Production LLM Systems Tutorial 5: Monitoring and Observability

Production LLM Systems Tutorial 5: Monitoring and Observability

Tutorial Series

  1. End-to-End Application Design
  2. Latency, Cost, and Quality
  3. Scalable Inference Architecture
  4. RAG and Data Pipelines
  5. Monitoring and Observability
  6. Evaluation and A/B Testing
  7. Security and Prompt Injection
  8. Human-in-the-Loop Workflows
  9. Cost Optimization
  10. Versioning and Disaster Recovery

LLM observability is not a dashboard with token counts.

A production LLM system is a distributed workflow. One user request can include retrieval, reranking, prompt assembly, one or more model calls, tool execution, output validation, policy checks, streaming, and feedback capture. If you only log the final answer, you cannot debug the system.

This tutorial builds an observability model with three layers: system, quality, and drift.

LLM observability layers for system metrics, answer quality, drift, privacy, and request tracing
LLM observability needs system health, quality signals, drift detection, privacy controls, and trace correlation.

Layer 1: System observability

System metrics answer: “Is the service healthy?”

Track:

MetricWhy it matters
TTFTUser-perceived responsiveness
TPOTStreaming smoothness and decode health
p50/p95/p99 latencyNormal and tail behavior
tokens/secServing throughput
queue depthSaturation and under-provisioning
error rateProvider, tool, validation, and policy failures
retry countHidden cost and latency multiplier
GPU utilizationWhether accelerators are doing useful work
KV cache hit ratePrefix reuse and routing quality
cache hit rateExact and semantic cache efficiency
cost per requestUnit economics

Split latency by span. A single “latency_ms” metric is not enough.

request
  retrieval: 430ms
  rerank: 170ms
  model prefill: 620ms
  first token: 790ms
  tool call: 2.4s
  final stream: 4.8s

That trace tells you what to fix.

Layer 2: Quality observability

Quality metrics answer: “Is the system useful and safe?”

Track:

SignalExample measurement
GroundednessDoes the answer stay within retrieved evidence?
Answer relevanceDoes it answer the user question?
Refusal qualityDoes it refuse only when appropriate?
Tool successDid tool calls complete with valid arguments?
Citation accuracyDo citations support the claims?
User feedbackThumbs, edits, re-asks, escalation
Human review outcomeAccepted, corrected, rejected

LLM-as-judge can help, but do not treat it as truth. Judges have bias. They can prefer longer answers, familiar wording, and their own model family’s style. Calibrate judges against human-labeled examples and use pairwise comparisons when subjective quality matters.

Layer 3: Drift observability

Drift metrics answer: “Is the input or output distribution changing?”

Watch:

  • input topic distribution
  • language mix
  • document corpus changes
  • embedding distribution
  • retrieval score distribution
  • output length
  • refusal rate
  • tool-call frequency
  • cost per task

Example: if retrieval scores drop after a document migration, the model may still answer fluently, but evidence quality has degraded. That is a RAG incident, even if uptime is perfect.

Trace schema

Use one trace per user request and child spans for each operation:

trace: request_id=req_123
  span: gateway.auth
  span: gateway.rate_limit
  span: orchestrator.route
  span: retrieval.query_rewrite
  span: retrieval.vector_search
  span: retrieval.bm25_search
  span: retrieval.rerank
  span: prompt.assemble
  span: model.call
  span: tool.execute
  span: output.validate
  span: stream.send

Each span should include enough metadata to debug behavior without storing raw secrets:

LLM trace anatomy with spans for gateway, retrieval, reranking, model call, tool execution, safety checks, and usage metadata
A useful trace shows every step that contributed to the final answer.
{
  "tenant_id": "tenant_a",
  "feature": "support_answer",
  "prompt_version": "support_v14",
  "model": "model_route_mid",
  "input_tokens": 1840,
  "output_tokens": 322,
  "cache_hit": false,
  "retrieved_documents": 5,
  "tool_count": 1
}

Privacy rules

Prompt and response capture is sensitive. Before storing traces:

  • redact PII
  • hash user identifiers where possible
  • separate raw payload storage from metrics
  • apply retention limits
  • restrict trace access by tenant and environment
  • avoid logging secrets in tool arguments

For high-risk domains, store structured summaries and references instead of raw prompts by default.

Alerting

Useful alerts are tied to user impact:

AlertLikely issue
TTFT p95 doubledQueueing, provider degradation, prefill overload
TPOT p95 increasedDecode saturation, KV pressure, model route regression
Retrieval recall eval droppedCorpus or embedding issue
Refusal rate spikedSafety policy or classifier regression
Cost per task roseRetry loop, prompt bloat, routing regression
Tool failures roseDownstream dependency or schema drift

Do not alert on every odd model answer. Use sampling, eval thresholds, and aggregate trends.

Tooling

Common tools include Langfuse, LangSmith, Arize Phoenix, Helicone, Weights & Biases, and OpenTelemetry-based pipelines. The exact vendor matters less than the trace model. If your trace does not show retrieval, prompt version, model route, tool calls, token usage, and cost, you will still be debugging blind.

Sources and receipts