Skip to content
Production LLM Systems Tutorial 8: Human-in-the-Loop Workflows

Production LLM Systems Tutorial 8: Human-in-the-Loop Workflows

Tutorial Series

  1. End-to-End Application Design
  2. Latency, Cost, and Quality
  3. Scalable Inference Architecture
  4. RAG and Data Pipelines
  5. Monitoring and Observability
  6. Evaluation and A/B Testing
  7. Security and Prompt Injection
  8. Human-in-the-Loop Workflows
  9. Cost Optimization
  10. Versioning and Disaster Recovery

Human-in-the-loop is not a button that says “send to human.”

It is a workflow system. It needs routing, queues, ownership, context, SLAs, review tools, feedback capture, and a way to turn corrections into better software. Without that, human review becomes a slow inbox attached to a fast model.

This tutorial designs a real HITL workflow.

Human-in-the-loop workflow routing LLM results through a risk router into auto completion, assisted review, human approval, and feedback capture
Human review is a workflow with routing, ownership, SLA, decisions, and feedback loops.

Step 1: Decide when humans enter

Use humans for risk, uncertainty, and accountability.

Common routing triggers:

TriggerExample
Low confidenceModel or evaluator reports insufficient evidence
High stakesMedical, legal, financial, safety, employment, security
Policy boundaryUser requests something near restricted behavior
Tool riskAction changes money, permissions, data, or external state
User escalationUser explicitly asks for a person
Quality failureJudge score or user feedback falls below threshold

Do not route every hard question to a human. Route the questions where human judgment changes the outcome.

Step 2: Define confidence as a system signal

Self-reported model confidence is weak by itself. Combine signals:

confidence =
  retrieval_score
  + rerank_margin
  + answer_groundedness
  + tool_success
  + policy_classifier_score
  + historical route quality

Example:

SignalValueInterpretation
top rerank score0.83Good
score gap top1-top20.04Ambiguous
groundedness judge0.61Weak
user risk levelhighNeeds care
tool statusunavailableCannot verify

This should route to a human or produce a limited answer.

Step 3: Build an escalation object

When the system escalates, create a structured task:

{
  "case_id": "case_123",
  "tenant_id": "tenant_a",
  "user_question": "...",
  "model_draft": "...",
  "evidence": [
    {"document_id": "policy_7", "quote": "...", "score": 0.91}
  ],
  "tool_results": [],
  "risk_level": "high",
  "reason_for_escalation": "low groundedness and high-risk domain",
  "sla_deadline": "2026-05-09T18:00:00Z"
}

Do not ask reviewers to reconstruct the trace from logs. Give them the draft, evidence, policy, and reason.

Escalation object containing request id, tenant, draft answer, evidence, confidence, risk reason, SLA, owner, and correction labels
Escalation tasks should carry enough structured context for fast review and useful correction capture.

Step 4: Design queues by skill and SLA

One queue is not enough.

Route by:

  • domain expertise
  • language
  • tenant
  • risk level
  • SLA
  • required approval authority

Example queues:

QueueSLAReviewer
Low-risk correction24hContent ops
Billing dispute4hSupport specialist
Legal policy question1 business dayLegal reviewer
Security action30mSecurity operator

The model should not decide final approval authority. The workflow system should.

Step 5: Capture corrections as training data

A good review UI captures:

  • accepted answer
  • edited answer
  • rejected answer
  • missing evidence
  • incorrect evidence
  • better retrieval query
  • policy reason
  • reviewer notes

Convert that into datasets:

reviewed_cases
  -> golden eval examples
  -> retrieval hard negatives
  -> prompt regression tests
  -> fine-tuning candidates
  -> policy classifier examples

Do not fine-tune on every correction blindly. Some corrections reflect temporary policy, bad retrieval, or one-off user context. Label first.

Step 6: Use approval gates for actions

For high-impact tools, use approval before execution:

model proposes action
  -> policy engine classifies impact
  -> dry-run tool produces preview
  -> human approves or edits
  -> system executes with idempotency key
  -> audit log records approver

Example:

{
  "proposed_action": "issue_refund",
  "amount": 4200,
  "currency": "USD",
  "customer_id": "cust_981",
  "policy_basis": "enterprise exception",
  "requires_approval": true
}

The human approves a structured action, not a paragraph.

Step 7: Measure the workflow

Track:

  • escalation rate
  • approval rate
  • rejection rate
  • average handling time
  • SLA miss rate
  • model draft acceptance rate
  • reviewer edit distance
  • repeat escalation reasons
  • downstream user satisfaction

If escalation rate rises after a release, something changed. It may be model quality, retrieval quality, policy strictness, or traffic mix.

Sources and receipts