Production LLM Systems Tutorial 6: Evaluation and A/B Testing
Tutorial Series
- End-to-End Application Design
- Latency, Cost, and Quality
- Scalable Inference Architecture
- RAG and Data Pipelines
- Monitoring and Observability
- Evaluation and A/B Testing
- Security and Prompt Injection
- Human-in-the-Loop Workflows
- Cost Optimization
- Versioning and Disaster Recovery
LLM systems need evals before they need clever prompts.
Without evals, every change is a debate. With evals, every change becomes a measured trade-off. The goal is not to create one perfect score. The goal is to catch regressions, compare options, and understand which layer failed.
This tutorial builds an evaluation program for production LLM systems.
Step 1: Split evals by layer
Do not grade only the final answer. Break the system into layers:
| Layer | Questions to ask |
|---|---|
| Retrieval | Did we find the right evidence? |
| Context packing | Did we include useful context without distraction? |
| Generation | Did the model answer correctly and use evidence? |
| Tool use | Did it call the right tool with valid arguments? |
| Policy | Did it refuse, redact, or escalate when required? |
| UX | Was the answer useful, concise, and actionable? |
If final-answer quality drops, layer-specific evals tell you where to look.
Step 2: Build a golden set
A golden set is a curated dataset of tasks with expected behavior.
For a RAG system, each row can include:
{
"id": "refund_042",
"user_question": "Can an enterprise customer get a refund after 45 days?",
"required_documents": ["refund_policy_v7_section_4"],
"expected_answer_points": [
"standard refund window is 30 days",
"enterprise exception requires approval",
"support should open an escalation ticket"
],
"must_not_include": [
"guaranteed refund"
],
"risk_level": "medium"
}Golden sets should include normal cases, edge cases, adversarial inputs, stale-document cases, and examples where the correct behavior is refusal or escalation.
Step 3: Use LLM judges carefully
LLM-as-judge is useful, especially for semantic quality, but it is not neutral.
Common judge biases:
- verbosity bias: longer answers look better
- position bias: first answer may win more often
- style bias: confident tone can hide errors
- model-family bias: the judge may prefer wording similar to its own outputs
Controls:
- randomize answer order in pairwise evals
- hide model identity
- ask for evidence-based grading
- require short rationales
- calibrate against human labels
- track judge drift when the judge model changes
Pairwise preference is often better than absolute scoring for subjective tasks. Instead of asking “Is this answer an 8?”, ask “Which answer better satisfies the rubric and why?”
Step 4: Use RAG-specific metrics
For RAG, use retrieval and grounding metrics:
| Metric | Meaning |
|---|---|
| Context recall | Did retrieval include required evidence? |
| Context precision | Was retrieved context mostly useful? |
| Faithfulness | Is the answer grounded in context? |
| Answer relevance | Does the answer address the question? |
| Citation correctness | Do citations support the claims? |
RAGAS popularized several of these metric families. Use them as smoke detectors, not final truth. A low faithfulness score should trigger investigation. A high score should not bypass human review for high-risk domains.
Step 5: Put evals in CI
Every prompt, retrieval, routing, and model change should run evals before deployment.
Minimum release gate:
run unit tests
run schema validation
run offline golden set
run safety set
run regression comparison against current production
block deploy if critical score drops
allow deploy if trade-off is approvedStore results with:
- git commit
- prompt version
- model version
- embedding version
- retrieval index version
- eval dataset version
- judge model version
If you cannot reproduce an eval, you cannot trust the release decision.
Step 6: Move from shadow to canary
Online rollout should be staged:
- Shadow traffic: run new route silently, do not show output.
- Internal canary: send trusted users to new route.
- Small external canary: 1 to 5 percent traffic.
- A/B test: compare metrics with enough sample size.
- Full rollout: keep rollback path.
LLM output variance is high. Small samples lie. Use larger sample sizes than you would expect for deterministic software changes, and segment by task type. A prompt may improve summarization while hurting tool calling.
Step 7: Measure cost and quality together
A cheaper route that reduces task completion is not cheaper. Track:
- cost per request
- cost per successful task
- escalation rate
- re-ask rate
- user correction rate
- latency
- quality score
Example:
| Route | Cost/request | Success | Cost/success |
|---|---|---|---|
| Large model | $0.040 | 92% | $0.043 |
| Small model | $0.010 | 60% | $0.017 |
| Router | $0.018 | 89% | $0.020 |
The router may be the best system even if it is not the cheapest request.
Step 8: Keep human review in the loop
Human review is how evals improve. Review samples where:
- judges disagree
- confidence is low
- user feedback is negative
- answers affect money, health, legal, or safety outcomes
- a new route changes behavior
Feed reviewed examples back into the golden set. Your eval set should grow from production reality, not only from synthetic examples.
Sources and receipts
- Ragas metrics documentation: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
- Arize Phoenix, LLM evaluations: https://arize.com/docs/phoenix/evaluation/llm-evals
- Langfuse evaluation documentation: https://langfuse.com/docs/
- OpenAI Batch API guide, useful for offline eval jobs: https://platform.openai.com/docs/guides/batch
