Production LLM Systems Tutorial 5: Monitoring and Observability

#ai #llm #observability #monitoring #opentelemetry #tutorial

Tutorial Series

LLM observability is not a dashboard with token counts.

A production LLM system is a distributed workflow. One user request can include retrieval, reranking, prompt assembly, one or more model calls, tool execution, output validation, policy checks, streaming, and feedback capture. If you only log the final answer, you cannot debug the system.

This tutorial builds an observability model with three layers: system, quality, and drift.

LLM observability layers for system metrics, answer quality, drift, privacy, and request tracing — LLM observability needs system health, quality signals, drift detection, privacy controls, and trace correlation.

Layer 1: System observability

System metrics answer: “Is the service healthy?”

Track:

Metric	Why it matters
TTFT	User-perceived responsiveness
TPOT	Streaming smoothness and decode health
p50/p95/p99 latency	Normal and tail behavior
tokens/sec	Serving throughput
queue depth	Saturation and under-provisioning
error rate	Provider, tool, validation, and policy failures
retry count	Hidden cost and latency multiplier
GPU utilization	Whether accelerators are doing useful work
KV cache hit rate	Prefix reuse and routing quality
cache hit rate	Exact and semantic cache efficiency
cost per request	Unit economics

Split latency by span. A single “latency_ms” metric is not enough.

request
  retrieval: 430ms
  rerank: 170ms
  model prefill: 620ms
  first token: 790ms
  tool call: 2.4s
  final stream: 4.8s

That trace tells you what to fix.

Layer 2: Quality observability

Quality metrics answer: “Is the system useful and safe?”

Track:

Signal	Example measurement
Groundedness	Does the answer stay within retrieved evidence?
Answer relevance	Does it answer the user question?
Refusal quality	Does it refuse only when appropriate?
Tool success	Did tool calls complete with valid arguments?
Citation accuracy	Do citations support the claims?
User feedback	Thumbs, edits, re-asks, escalation
Human review outcome	Accepted, corrected, rejected

LLM-as-judge can help, but do not treat it as truth. Judges have bias. They can prefer longer answers, familiar wording, and their own model family’s style. Calibrate judges against human-labeled examples and use pairwise comparisons when subjective quality matters.

Layer 3: Drift observability

Drift metrics answer: “Is the input or output distribution changing?”

Watch:

input topic distribution
language mix
document corpus changes
embedding distribution
retrieval score distribution
output length
refusal rate
tool-call frequency
cost per task

Example: if retrieval scores drop after a document migration, the model may still answer fluently, but evidence quality has degraded. That is a RAG incident, even if uptime is perfect.

Trace schema

Use one trace per user request and child spans for each operation:

trace: request_id=req_123
  span: gateway.auth
  span: gateway.rate_limit
  span: orchestrator.route
  span: retrieval.query_rewrite
  span: retrieval.vector_search
  span: retrieval.bm25_search
  span: retrieval.rerank
  span: prompt.assemble
  span: model.call
  span: tool.execute
  span: output.validate
  span: stream.send

Each span should include enough metadata to debug behavior without storing raw secrets:

LLM trace anatomy with spans for gateway, retrieval, reranking, model call, tool execution, safety checks, and usage metadata — A useful trace shows every step that contributed to the final answer.

{
  "tenant_id": "tenant_a",
  "feature": "support_answer",
  "prompt_version": "support_v14",
  "model": "model_route_mid",
  "input_tokens": 1840,
  "output_tokens": 322,
  "cache_hit": false,
  "retrieved_documents": 5,
  "tool_count": 1
}

Privacy rules

Prompt and response capture is sensitive. Before storing traces:

redact PII
hash user identifiers where possible
separate raw payload storage from metrics
apply retention limits
restrict trace access by tenant and environment
avoid logging secrets in tool arguments

For high-risk domains, store structured summaries and references instead of raw prompts by default.

Alerting

Useful alerts are tied to user impact:

Alert	Likely issue
TTFT p95 doubled	Queueing, provider degradation, prefill overload
TPOT p95 increased	Decode saturation, KV pressure, model route regression
Retrieval recall eval dropped	Corpus or embedding issue
Refusal rate spiked	Safety policy or classifier regression
Cost per task rose	Retry loop, prompt bloat, routing regression
Tool failures rose	Downstream dependency or schema drift

Do not alert on every odd model answer. Use sampling, eval thresholds, and aggregate trends.

Tooling

Common tools include Langfuse, LangSmith, Arize Phoenix, Helicone, Weights & Biases, and OpenTelemetry-based pipelines. The exact vendor matters less than the trace model. If your trace does not show retrieval, prompt version, model route, tool calls, token usage, and cost, you will still be debugging blind.

Sources and receipts

OpenTelemetry, “Semantic conventions for generative AI systems”: https://opentelemetry.io/docs/specs/semconv/gen-ai/
Langfuse, “LLM Observability and Application Tracing”: https://langfuse.com/docs/observability/overview
Arize Phoenix documentation: https://arize.com/docs/phoenix
Helicone documentation: https://docs.helicone.ai/getting-started/platform-overview

Production LLM Systems Tutorial 4: RAG and Data Pipelines Production LLM Systems Tutorial 6: Evaluation and A/B Testing