Your RAG Demo Passed. Your RAG System Needs a Judge: RAGAS, Humans, and Evidence
RAG demos are forgiving. Production RAG is not.
A demo needs one good answer on stage. A production RAG system needs thousands of correct answers across stale documents, ambiguous questions, bad retrieval, missing permissions, partial context, and users who ask “same thing as yesterday” while changing the noun that matters.
RAG evaluation is hard because failures hide in different layers:
- query understanding
- retrieval recall
- ranking precision
- context assembly
- permission filtering
- answer faithfulness
- citation quality
- final usefulness
If you only grade the final answer, you do not know what broke. If you only grade retrieval, you do not know whether the model used the evidence. You need both automated evals and human judgment.
RAGAS is useful, not a replacement for judgment
RAGAS popularized a practical set of RAG metrics such as faithfulness, answer relevancy, context precision, and context recall. The value is not that any single metric is perfect. The value is that it forces the team to separate “did we retrieve the right evidence?” from “did the model answer using that evidence?”
Typical metric questions:
| Metric area | Question it answers |
|---|---|
| Context recall | Did retrieval find the evidence needed? |
| Context precision | Was the retrieved context mostly useful? |
| Faithfulness | Is the answer grounded in supplied context? |
| Answer relevancy | Did the answer address the user’s question? |
These are excellent smoke detectors. They are not the fire department.
LLM-as-judge metrics can be noisy. They can inherit model bias. They can miss product nuance. They can reward answer style over truth. Use them to find regressions and prioritize review, not to declare final truth without calibration.
Build a golden dataset the boring way
Your eval set should include:
- common user questions
- rare but important questions
- adversarial questions
- permission-boundary questions
- outdated-document traps
- “answer not found” cases
- ambiguous questions requiring clarification
- multi-hop questions
- citation-required questions
For each item, store:
question
expected answer or rubric
gold supporting documents
forbidden documents
tenant / role scope
freshness expectation
known failure mode
human severity labelThe “answer not found” cases are critical. A RAG system that always answers is not helpful; it is confident in the way a broken smoke alarm is confident.
Automated metrics: fast but calibrated
Use RAGAS or a similar framework to run frequent automated evals:
- every prompt change
- every retrieval index change
- every chunking change
- every embedding model change
- every reranker change
- every generation model change
But calibrate the judge.
Take a sample of automated results and have humans label them. Compare:
- where the judge agrees
- where the judge is too generous
- where the judge is too harsh
- which domains are noisy
- which metrics correlate with user outcomes
Do not hide this calibration. Put it in the dashboard.
Human evals: expensive, indispensable
Humans are best at:
- policy nuance
- domain truth
- subtle missing context
- citation trustworthiness
- answer usefulness
- hallucination severity
- whether the system should have refused or clarified
The rubric matters more than the rating scale.
Bad rubric:
Rate this answer 1-5.Better rubric:
Is the answer supported by the cited context?
Did it answer the user's actual question?
Did it omit a critical caveat?
Would this answer be safe for a customer to act on?
If wrong, is the severity low, medium, or high?Production dashboard
Track:
- retrieval recall on gold docs
- context precision
- answer faithfulness
- answer relevancy
- citation coverage
- unsupported claim rate
- “no answer” correctness
- human severity-weighted pass rate
- user correction rate
- source freshness failures
- permission-filter misses
- latency and cost by query class
The most important metric is not a single score. It is the breakdown:
bad answer because retrieval failed
bad answer because context was noisy
bad answer because model ignored evidence
bad answer because policy was stale
bad answer because the question was ambiguousThat breakdown tells you what to fix.
Release gate
For production RAG, I like a gate like this:
- no high-severity regressions on golden set
- faithfulness above threshold
- context recall above threshold for answerable questions
- “not found” behavior passes
- human eval sample accepted
- no permission-boundary failures
- latency and cost inside budget
If the system fails, do not just tweak the prompt. Inspect the layer that failed.
Sources worth reading
- RAGAS paper and RAGAS documentation for RAG-specific metrics.
- OpenAI Evals for evaluation harness patterns.
- LangSmith evaluation docs for dataset and evaluator workflows.
- LlamaIndex evaluation docs for retrieval and response evaluation patterns.
