Your RAG Demo Passed. Your RAG System Needs a Judge: RAGAS, Humans, and Evidence

#ai #rag #evaluation #ragas #llm #human-evals #retrieval

RAG demos are forgiving. Production RAG is not.

A demo needs one good answer on stage. A production RAG system needs thousands of correct answers across stale documents, ambiguous questions, bad retrieval, missing permissions, partial context, and users who ask “same thing as yesterday” while changing the noun that matters.

RAG evaluation is hard because failures hide in different layers:

query understanding
retrieval recall
ranking precision
context assembly
permission filtering
answer faithfulness
citation quality
final usefulness

If you only grade the final answer, you do not know what broke. If you only grade retrieval, you do not know whether the model used the evidence. You need both automated evals and human judgment.

RAGAS is useful, not a replacement for judgment

RAGAS popularized a practical set of RAG metrics such as faithfulness, answer relevancy, context precision, and context recall. The value is not that any single metric is perfect. The value is that it forces the team to separate “did we retrieve the right evidence?” from “did the model answer using that evidence?”

Typical metric questions:

Metric area	Question it answers
Context recall	Did retrieval find the evidence needed?
Context precision	Was the retrieved context mostly useful?
Faithfulness	Is the answer grounded in supplied context?
Answer relevancy	Did the answer address the user’s question?

These are excellent smoke detectors. They are not the fire department.

LLM-as-judge metrics can be noisy. They can inherit model bias. They can miss product nuance. They can reward answer style over truth. Use them to find regressions and prioritize review, not to declare final truth without calibration.

RAG evals are most useful when they localize failure, not just produce one score.

Build a golden dataset the boring way

Your eval set should include:

common user questions
rare but important questions
adversarial questions
permission-boundary questions
outdated-document traps
“answer not found” cases
ambiguous questions requiring clarification
multi-hop questions
citation-required questions

For each item, store:

question
expected answer or rubric
gold supporting documents
forbidden documents
tenant / role scope
freshness expectation
known failure mode
human severity label

The “answer not found” cases are critical. A RAG system that always answers is not helpful; it is confident in the way a broken smoke alarm is confident.

Automated metrics: fast but calibrated

Use RAGAS or a similar framework to run frequent automated evals:

every prompt change
every retrieval index change
every chunking change
every embedding model change
every reranker change
every generation model change

But calibrate the judge.

Take a sample of automated results and have humans label them. Compare:

where the judge agrees
where the judge is too generous
where the judge is too harsh
which domains are noisy
which metrics correlate with user outcomes

Do not hide this calibration. Put it in the dashboard.

Human evals: expensive, indispensable

Humans are best at:

policy nuance
domain truth
subtle missing context
citation trustworthiness
answer usefulness
hallucination severity
whether the system should have refused or clarified

The rubric matters more than the rating scale.

Bad rubric:

Rate this answer 1-5.

Better rubric:

Is the answer supported by the cited context?
Did it answer the user's actual question?
Did it omit a critical caveat?
Would this answer be safe for a customer to act on?
If wrong, is the severity low, medium, or high?

A RAG eval program is a learning loop: logs become labels, labels become tests, tests become release gates.

Production dashboard

Track:

retrieval recall on gold docs
context precision
answer faithfulness
answer relevancy
citation coverage
unsupported claim rate
“no answer” correctness
human severity-weighted pass rate
user correction rate
source freshness failures
permission-filter misses
latency and cost by query class

The most important metric is not a single score. It is the breakdown:

bad answer because retrieval failed
bad answer because context was noisy
bad answer because model ignored evidence
bad answer because policy was stale
bad answer because the question was ambiguous

That breakdown tells you what to fix.

Release gate

For production RAG, I like a gate like this:

no high-severity regressions on golden set
faithfulness above threshold
context recall above threshold for answerable questions
“not found” behavior passes
human eval sample accepted
no permission-boundary failures
latency and cost inside budget

If the system fails, do not just tweak the prompt. Inspect the layer that failed.

Sources worth reading

RAGAS paper and RAGAS documentation for RAG-specific metrics.
OpenAI Evals for evaluation harness patterns.
LangSmith evaluation docs for dataset and evaluator workflows.
LlamaIndex evaluation docs for retrieval and response evaluation patterns.

Agentic AI Needs Smarter Inference: Hints, Priority, and Cache Lifecycle YC's 2026 Startup Map: AI Has Left the Chatbox