Skip to content
Your RAG Demo Passed. Your RAG System Needs a Judge: RAGAS, Humans, and Evidence

Your RAG Demo Passed. Your RAG System Needs a Judge: RAGAS, Humans, and Evidence

RAG demos are forgiving. Production RAG is not.

A demo needs one good answer on stage. A production RAG system needs thousands of correct answers across stale documents, ambiguous questions, bad retrieval, missing permissions, partial context, and users who ask “same thing as yesterday” while changing the noun that matters.

RAG evaluation is hard because failures hide in different layers:

  • query understanding
  • retrieval recall
  • ranking precision
  • context assembly
  • permission filtering
  • answer faithfulness
  • citation quality
  • final usefulness

If you only grade the final answer, you do not know what broke. If you only grade retrieval, you do not know whether the model used the evidence. You need both automated evals and human judgment.

RAGAS is useful, not a replacement for judgment

RAGAS popularized a practical set of RAG metrics such as faithfulness, answer relevancy, context precision, and context recall. The value is not that any single metric is perfect. The value is that it forces the team to separate “did we retrieve the right evidence?” from “did the model answer using that evidence?”

Typical metric questions:

Metric areaQuestion it answers
Context recallDid retrieval find the evidence needed?
Context precisionWas the retrieved context mostly useful?
FaithfulnessIs the answer grounded in supplied context?
Answer relevancyDid the answer address the user’s question?

These are excellent smoke detectors. They are not the fire department.

LLM-as-judge metrics can be noisy. They can inherit model bias. They can miss product nuance. They can reward answer style over truth. Use them to find regressions and prioritize review, not to declare final truth without calibration.

RAG evaluation layersA RAG evaluation pipeline with retrieval, context, generation, citation, and human review layers.Evaluate the pipeline, not just the answerRAG breaks in layers. The eval harness should show which layer failed.Queryintent + scopeRetrievalrecall, precisionContextchunk qualityAnswerfaithful, usefulRAGAS metricsfast regression signalsHuman evalstruth, policy, nuanceProduction signalscorrections, retries, clicks
RAG evals are most useful when they localize failure, not just produce one score.

Build a golden dataset the boring way

Your eval set should include:

  • common user questions
  • rare but important questions
  • adversarial questions
  • permission-boundary questions
  • outdated-document traps
  • “answer not found” cases
  • ambiguous questions requiring clarification
  • multi-hop questions
  • citation-required questions

For each item, store:

question
expected answer or rubric
gold supporting documents
forbidden documents
tenant / role scope
freshness expectation
known failure mode
human severity label

The “answer not found” cases are critical. A RAG system that always answers is not helpful; it is confident in the way a broken smoke alarm is confident.

Automated metrics: fast but calibrated

Use RAGAS or a similar framework to run frequent automated evals:

  • every prompt change
  • every retrieval index change
  • every chunking change
  • every embedding model change
  • every reranker change
  • every generation model change

But calibrate the judge.

Take a sample of automated results and have humans label them. Compare:

  • where the judge agrees
  • where the judge is too generous
  • where the judge is too harsh
  • which domains are noisy
  • which metrics correlate with user outcomes

Do not hide this calibration. Put it in the dashboard.

Human evals: expensive, indispensable

Humans are best at:

  • policy nuance
  • domain truth
  • subtle missing context
  • citation trustworthiness
  • answer usefulness
  • hallucination severity
  • whether the system should have refused or clarified

The rubric matters more than the rating scale.

Bad rubric:

Rate this answer 1-5.

Better rubric:

Is the answer supported by the cited context?
Did it answer the user's actual question?
Did it omit a critical caveat?
Would this answer be safe for a customer to act on?
If wrong, is the severity low, medium, or high?
RAG evaluation feedback loopAutomated RAG metrics and human evals feed into dataset updates, retriever tuning, and release gates.The eval loop should improve the system, not just grade itRAGAS catches regressions. Humans decide what matters. Production logs reveal what you forgot to test.Automated evalsRAGAS + custom checksHuman reviewrubrics + severityProd signalsclicks, retries, editsGolden datasetnew cases, better labelsRelease gateship only if safe
A RAG eval program is a learning loop: logs become labels, labels become tests, tests become release gates.

Production dashboard

Track:

  • retrieval recall on gold docs
  • context precision
  • answer faithfulness
  • answer relevancy
  • citation coverage
  • unsupported claim rate
  • “no answer” correctness
  • human severity-weighted pass rate
  • user correction rate
  • source freshness failures
  • permission-filter misses
  • latency and cost by query class

The most important metric is not a single score. It is the breakdown:

bad answer because retrieval failed
bad answer because context was noisy
bad answer because model ignored evidence
bad answer because policy was stale
bad answer because the question was ambiguous

That breakdown tells you what to fix.

Release gate

For production RAG, I like a gate like this:

  • no high-severity regressions on golden set
  • faithfulness above threshold
  • context recall above threshold for answerable questions
  • “not found” behavior passes
  • human eval sample accepted
  • no permission-boundary failures
  • latency and cost inside budget

If the system fails, do not just tweak the prompt. Inspect the layer that failed.

Sources worth reading