Skip to content
Production LLM Systems Tutorial 1: End-to-End Application Design

Production LLM Systems Tutorial 1: End-to-End Application Design

Tutorial Series

  1. End-to-End Application Design
  2. Latency, Cost, and Quality
  3. Scalable Inference Architecture
  4. RAG and Data Pipelines
  5. Monitoring and Observability
  6. Evaluation and A/B Testing
  7. Security and Prompt Injection
  8. Human-in-the-Loop Workflows
  9. Cost Optimization
  10. Versioning and Disaster Recovery

Most LLM applications fail in the gaps between components.

The model may be excellent. The prompt may look clean. The demo may answer one question beautifully. Then production arrives: users refresh, requests retry, providers throttle, tools time out, RAG returns stale evidence, streaming breaks halfway through a response, and your token bill records every mistake.

This tutorial builds the system shape that survives those conditions.

End-to-end production LLM application architecture with client, gateway, orchestrator, retrieval, tools, LLM gateway, inference, streaming, and telemetry
A production LLM application is a request path with state, policy, retrieval, tools, model routing, and telemetry around the model call.

The target architecture

Start with the full request path:

client
  -> API gateway
       auth, rate limits, tenant identity, idempotency key
  -> conversation service
       session state, history compaction, policy context
  -> orchestrator
       prompt assembly, retrieval, tool planning, model routing
  -> retrieval layer
       query rewrite, hybrid search, rerank, context packing
  -> tool layer
       schemas, permissions, execution, retries
  -> LLM gateway
       provider abstraction, fallback, cost attribution
  -> inference endpoint
       hosted API or self-served model
  -> response stream
       SSE or WebSocket, partial events, final accounting
  -> telemetry
       traces, quality signals, cost, feedback

Do not draw “app -> LLM -> answer.” That hides every decision that matters.

Step 1: Choose the interaction contract

For chat-like applications, streaming is usually the default. The user feels time to first token before they feel total latency. A 700 ms TTFT with a 6 second full answer often feels better than a silent 3 second request-response call.

Use Server-Sent Events when the server mostly streams tokens and status events to the browser. Use WebSocket when the client also needs low-latency bidirectional events, such as collaborative editing, voice turn-taking, or live tool state.

A good stream emits more than tokens:

{"type":"message.started","request_id":"req_123"}
{"type":"retrieval.completed","documents":5}
{"type":"output_text.delta","text":"The"}
{"type":"output_text.delta","text":" answer"}
{"type":"usage.completed","input_tokens":1840,"output_tokens":322}
{"type":"message.completed"}

That event model gives the UI progress, gives observability clean spans, and gives clients a reliable way to recover from partial output.

Typed streaming event contract showing started, retrieval, token delta, tool, usage, and terminal events
Streaming should expose typed lifecycle events, not just raw token chunks.

Step 2: Treat conversation state as product data

There are three common state models:

ModelWhat it meansUse it when
Client-held historyThe browser sends prior messages each turnLow-risk consumer apps, simple prototypes
Server session storeBackend stores messages and sends a compact window to the modelMost production apps
Retrieved memoryPast turns are embedded and retrieved like documentsLong-lived assistants and workflow systems

Do not blindly replay the full conversation. Context windows are larger now, but cost and attention quality still matter. Use a sliding window for recent turns, a summary for older turns, and retrieval for specific prior facts.

The dangerous part is tenant separation. Every session, prompt template, retrieved document, tool call, and cache entry needs a tenant boundary. If you share a semantic cache across tenants without strict namespacing, you have built a data leak with a nice latency profile.

Step 3: Make the orchestrator explicit

The orchestrator is where the application becomes more than a prompt wrapper. It owns:

  • prompt version selection
  • model routing
  • retrieval decisions
  • tool-call permissions
  • context budget allocation
  • retries and fallbacks
  • output validation
  • trace correlation

Keep it boring. A deterministic state machine is easier to debug than a pile of hidden framework callbacks. For many systems, this is enough:

classify request
  -> decide route
  -> retrieve evidence if needed
  -> assemble prompt
  -> call model
  -> execute approved tool calls
  -> validate output
  -> stream final response
  -> record trace and cost

Agents can be useful, but do not make every request an agent loop. A password reset question, a refund policy lookup, and a code migration task do not need the same control flow.

Step 4: Put a gateway between your app and model providers

An LLM gateway pays for itself quickly. It should handle:

  • provider credentials
  • routing by model, tenant, feature, and region
  • retries with backoff
  • fallback models
  • timeout budgets
  • token accounting
  • per-tenant cost attribution
  • audit logs

Do not let every service call model providers directly. That creates duplicate retry logic, inconsistent safety policy, and impossible cost attribution.

The gateway also owns idempotency. Token-charging APIs make duplicate retries expensive. If the client retries the same request after a timeout, the backend should be able to return the prior result or resume the stream instead of generating twice.

Step 5: Add graceful degradation

There are four failure modes worth designing on day one:

FailureGood degradation
Primary model unavailableRoute to a smaller or alternate provider model
Retrieval unavailableAnswer only if the task is safe without retrieval, otherwise explain the limitation
Tool unavailableContinue with a partial answer only if the tool is optional
Streaming interruptedLet the client reconnect with request id and recover final state

Graceful degradation is not fake confidence. If the answer needs fresh retrieved evidence, say the system cannot complete the task rather than producing a polished guess.

Step 6: Define the minimum telemetry contract

Every request should produce a trace with:

  • tenant id and feature name
  • prompt version
  • model route and provider
  • retrieval query, document ids, and rerank scores
  • tool calls, arguments, status, and latency
  • TTFT, TPOT, total latency
  • input tokens, output tokens, cached tokens
  • final status and user feedback

Redact sensitive fields before storage. Observability that stores raw private prompts without a policy will eventually become the incident.

A practical build order

Build in this order:

  1. Gateway with auth, rate limits, and request ids.
  2. Basic streaming model call.
  3. Conversation session store.
  4. Prompt registry with versioned templates.
  5. Retrieval path.
  6. Tool execution path.
  7. Tracing and cost attribution.
  8. Fallback model chain.
  9. Eval set and release gate.
  10. Human escalation for low-confidence or high-risk outputs.

The order matters. If you add retrieval and tools before request identity, tracing, and idempotency, debugging becomes guesswork.

Sources and receipts