Skip to content
Production LLM Systems Tutorial 6: Evaluation and A/B Testing

Production LLM Systems Tutorial 6: Evaluation and A/B Testing

Tutorial Series

  1. End-to-End Application Design
  2. Latency, Cost, and Quality
  3. Scalable Inference Architecture
  4. RAG and Data Pipelines
  5. Monitoring and Observability
  6. Evaluation and A/B Testing
  7. Security and Prompt Injection
  8. Human-in-the-Loop Workflows
  9. Cost Optimization
  10. Versioning and Disaster Recovery

LLM systems need evals before they need clever prompts.

Without evals, every change is a debate. With evals, every change becomes a measured trade-off. The goal is not to create one perfect score. The goal is to catch regressions, compare options, and understand which layer failed.

This tutorial builds an evaluation program for production LLM systems.

LLM evaluation and release loop from change to golden set, offline evaluation, CI gate, canary, online experiment, and feedback store
Evaluation turns LLM changes into a measured release process.

Step 1: Split evals by layer

Do not grade only the final answer. Break the system into layers:

LayerQuestions to ask
RetrievalDid we find the right evidence?
Context packingDid we include useful context without distraction?
GenerationDid the model answer correctly and use evidence?
Tool useDid it call the right tool with valid arguments?
PolicyDid it refuse, redact, or escalate when required?
UXWas the answer useful, concise, and actionable?

If final-answer quality drops, layer-specific evals tell you where to look.

Step 2: Build a golden set

A golden set is a curated dataset of tasks with expected behavior.

For a RAG system, each row can include:

{
  "id": "refund_042",
  "user_question": "Can an enterprise customer get a refund after 45 days?",
  "required_documents": ["refund_policy_v7_section_4"],
  "expected_answer_points": [
    "standard refund window is 30 days",
    "enterprise exception requires approval",
    "support should open an escalation ticket"
  ],
  "must_not_include": [
    "guaranteed refund"
  ],
  "risk_level": "medium"
}

Golden sets should include normal cases, edge cases, adversarial inputs, stale-document cases, and examples where the correct behavior is refusal or escalation.

Step 3: Use LLM judges carefully

LLM-as-judge is useful, especially for semantic quality, but it is not neutral.

Common judge biases:

  • verbosity bias: longer answers look better
  • position bias: first answer may win more often
  • style bias: confident tone can hide errors
  • model-family bias: the judge may prefer wording similar to its own outputs

Controls:

  • randomize answer order in pairwise evals
  • hide model identity
  • ask for evidence-based grading
  • require short rationales
  • calibrate against human labels
  • track judge drift when the judge model changes

Pairwise preference is often better than absolute scoring for subjective tasks. Instead of asking “Is this answer an 8?”, ask “Which answer better satisfies the rubric and why?”

LLM judge calibration process with human labels, judge prompt, bias checks, agreement measurement, and pairwise preference
LLM judges are measurement instruments. Calibrate them against human labels and known bias modes.

Step 4: Use RAG-specific metrics

For RAG, use retrieval and grounding metrics:

MetricMeaning
Context recallDid retrieval include required evidence?
Context precisionWas retrieved context mostly useful?
FaithfulnessIs the answer grounded in context?
Answer relevanceDoes the answer address the question?
Citation correctnessDo citations support the claims?

RAGAS popularized several of these metric families. Use them as smoke detectors, not final truth. A low faithfulness score should trigger investigation. A high score should not bypass human review for high-risk domains.

Step 5: Put evals in CI

Every prompt, retrieval, routing, and model change should run evals before deployment.

Minimum release gate:

run unit tests
run schema validation
run offline golden set
run safety set
run regression comparison against current production
block deploy if critical score drops
allow deploy if trade-off is approved

Store results with:

  • git commit
  • prompt version
  • model version
  • embedding version
  • retrieval index version
  • eval dataset version
  • judge model version

If you cannot reproduce an eval, you cannot trust the release decision.

Step 6: Move from shadow to canary

Online rollout should be staged:

  1. Shadow traffic: run new route silently, do not show output.
  2. Internal canary: send trusted users to new route.
  3. Small external canary: 1 to 5 percent traffic.
  4. A/B test: compare metrics with enough sample size.
  5. Full rollout: keep rollback path.

LLM output variance is high. Small samples lie. Use larger sample sizes than you would expect for deterministic software changes, and segment by task type. A prompt may improve summarization while hurting tool calling.

Step 7: Measure cost and quality together

A cheaper route that reduces task completion is not cheaper. Track:

  • cost per request
  • cost per successful task
  • escalation rate
  • re-ask rate
  • user correction rate
  • latency
  • quality score

Example:

RouteCost/requestSuccessCost/success
Large model$0.04092%$0.043
Small model$0.01060%$0.017
Router$0.01889%$0.020

The router may be the best system even if it is not the cheapest request.

Step 8: Keep human review in the loop

Human review is how evals improve. Review samples where:

  • judges disagree
  • confidence is low
  • user feedback is negative
  • answers affect money, health, legal, or safety outcomes
  • a new route changes behavior

Feed reviewed examples back into the golden set. Your eval set should grow from production reality, not only from synthetic examples.

Sources and receipts