Production LLM Systems Tutorial 6: Evaluation and A/B Testing

#ai #llm #evaluation #ab-testing #ragas #tutorial

Tutorial Series

LLM systems need evals before they need clever prompts.

Without evals, every change is a debate. With evals, every change becomes a measured trade-off. The goal is not to create one perfect score. The goal is to catch regressions, compare options, and understand which layer failed.

This tutorial builds an evaluation program for production LLM systems.

LLM evaluation and release loop from change to golden set, offline evaluation, CI gate, canary, online experiment, and feedback store — Evaluation turns LLM changes into a measured release process.

Step 1: Split evals by layer

Do not grade only the final answer. Break the system into layers:

Layer	Questions to ask
Retrieval	Did we find the right evidence?
Context packing	Did we include useful context without distraction?
Generation	Did the model answer correctly and use evidence?
Tool use	Did it call the right tool with valid arguments?
Policy	Did it refuse, redact, or escalate when required?
UX	Was the answer useful, concise, and actionable?

If final-answer quality drops, layer-specific evals tell you where to look.

Step 2: Build a golden set

A golden set is a curated dataset of tasks with expected behavior.

For a RAG system, each row can include:

{
  "id": "refund_042",
  "user_question": "Can an enterprise customer get a refund after 45 days?",
  "required_documents": ["refund_policy_v7_section_4"],
  "expected_answer_points": [
    "standard refund window is 30 days",
    "enterprise exception requires approval",
    "support should open an escalation ticket"
  ],
  "must_not_include": [
    "guaranteed refund"
  ],
  "risk_level": "medium"
}

Golden sets should include normal cases, edge cases, adversarial inputs, stale-document cases, and examples where the correct behavior is refusal or escalation.

Step 3: Use LLM judges carefully

LLM-as-judge is useful, especially for semantic quality, but it is not neutral.

Common judge biases:

verbosity bias: longer answers look better
position bias: first answer may win more often
style bias: confident tone can hide errors
model-family bias: the judge may prefer wording similar to its own outputs

Controls:

randomize answer order in pairwise evals
hide model identity
ask for evidence-based grading
require short rationales
calibrate against human labels
track judge drift when the judge model changes

Pairwise preference is often better than absolute scoring for subjective tasks. Instead of asking “Is this answer an 8?”, ask “Which answer better satisfies the rubric and why?”

LLM judge calibration process with human labels, judge prompt, bias checks, agreement measurement, and pairwise preference — LLM judges are measurement instruments. Calibrate them against human labels and known bias modes.

Step 4: Use RAG-specific metrics

For RAG, use retrieval and grounding metrics:

Metric	Meaning
Context recall	Did retrieval include required evidence?
Context precision	Was retrieved context mostly useful?
Faithfulness	Is the answer grounded in context?
Answer relevance	Does the answer address the question?
Citation correctness	Do citations support the claims?

RAGAS popularized several of these metric families. Use them as smoke detectors, not final truth. A low faithfulness score should trigger investigation. A high score should not bypass human review for high-risk domains.

Step 5: Put evals in CI

Every prompt, retrieval, routing, and model change should run evals before deployment.

Minimum release gate:

run unit tests
run schema validation
run offline golden set
run safety set
run regression comparison against current production
block deploy if critical score drops
allow deploy if trade-off is approved

Store results with:

git commit
prompt version
model version
embedding version
retrieval index version
eval dataset version
judge model version

If you cannot reproduce an eval, you cannot trust the release decision.

Step 6: Move from shadow to canary

Online rollout should be staged:

Shadow traffic: run new route silently, do not show output.
Internal canary: send trusted users to new route.
Small external canary: 1 to 5 percent traffic.
A/B test: compare metrics with enough sample size.
Full rollout: keep rollback path.

LLM output variance is high. Small samples lie. Use larger sample sizes than you would expect for deterministic software changes, and segment by task type. A prompt may improve summarization while hurting tool calling.

Step 7: Measure cost and quality together

A cheaper route that reduces task completion is not cheaper. Track:

cost per request
cost per successful task
escalation rate
re-ask rate
user correction rate
latency
quality score

Example:

Route	Cost/request	Success	Cost/success
Large model	$0.040	92%	$0.043
Small model	$0.010	60%	$0.017
Router	$0.018	89%	$0.020

The router may be the best system even if it is not the cheapest request.

Step 8: Keep human review in the loop

Human review is how evals improve. Review samples where:

judges disagree
confidence is low
user feedback is negative
answers affect money, health, legal, or safety outcomes
a new route changes behavior

Feed reviewed examples back into the golden set. Your eval set should grow from production reality, not only from synthetic examples.

Sources and receipts

Ragas metrics documentation: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
Arize Phoenix, LLM evaluations: https://arize.com/docs/phoenix/evaluation/llm-evals
Langfuse evaluation documentation: https://langfuse.com/docs/
OpenAI Batch API guide, useful for offline eval jobs: https://platform.openai.com/docs/guides/batch

Production LLM Systems Tutorial 5: Monitoring and Observability Production LLM Systems Tutorial 7: Security and Prompt Injection