Production LLM Systems Tutorial 1: End-to-End Application Design
Tutorial Series
- End-to-End Application Design
- Latency, Cost, and Quality
- Scalable Inference Architecture
- RAG and Data Pipelines
- Monitoring and Observability
- Evaluation and A/B Testing
- Security and Prompt Injection
- Human-in-the-Loop Workflows
- Cost Optimization
- Versioning and Disaster Recovery
Most LLM applications fail in the gaps between components.
The model may be excellent. The prompt may look clean. The demo may answer one question beautifully. Then production arrives: users refresh, requests retry, providers throttle, tools time out, RAG returns stale evidence, streaming breaks halfway through a response, and your token bill records every mistake.
This tutorial builds the system shape that survives those conditions.
The target architecture
Start with the full request path:
client
-> API gateway
auth, rate limits, tenant identity, idempotency key
-> conversation service
session state, history compaction, policy context
-> orchestrator
prompt assembly, retrieval, tool planning, model routing
-> retrieval layer
query rewrite, hybrid search, rerank, context packing
-> tool layer
schemas, permissions, execution, retries
-> LLM gateway
provider abstraction, fallback, cost attribution
-> inference endpoint
hosted API or self-served model
-> response stream
SSE or WebSocket, partial events, final accounting
-> telemetry
traces, quality signals, cost, feedbackDo not draw “app -> LLM -> answer.” That hides every decision that matters.
Step 1: Choose the interaction contract
For chat-like applications, streaming is usually the default. The user feels time to first token before they feel total latency. A 700 ms TTFT with a 6 second full answer often feels better than a silent 3 second request-response call.
Use Server-Sent Events when the server mostly streams tokens and status events to the browser. Use WebSocket when the client also needs low-latency bidirectional events, such as collaborative editing, voice turn-taking, or live tool state.
A good stream emits more than tokens:
{"type":"message.started","request_id":"req_123"}
{"type":"retrieval.completed","documents":5}
{"type":"output_text.delta","text":"The"}
{"type":"output_text.delta","text":" answer"}
{"type":"usage.completed","input_tokens":1840,"output_tokens":322}
{"type":"message.completed"}That event model gives the UI progress, gives observability clean spans, and gives clients a reliable way to recover from partial output.
Step 2: Treat conversation state as product data
There are three common state models:
| Model | What it means | Use it when |
|---|---|---|
| Client-held history | The browser sends prior messages each turn | Low-risk consumer apps, simple prototypes |
| Server session store | Backend stores messages and sends a compact window to the model | Most production apps |
| Retrieved memory | Past turns are embedded and retrieved like documents | Long-lived assistants and workflow systems |
Do not blindly replay the full conversation. Context windows are larger now, but cost and attention quality still matter. Use a sliding window for recent turns, a summary for older turns, and retrieval for specific prior facts.
The dangerous part is tenant separation. Every session, prompt template, retrieved document, tool call, and cache entry needs a tenant boundary. If you share a semantic cache across tenants without strict namespacing, you have built a data leak with a nice latency profile.
Step 3: Make the orchestrator explicit
The orchestrator is where the application becomes more than a prompt wrapper. It owns:
- prompt version selection
- model routing
- retrieval decisions
- tool-call permissions
- context budget allocation
- retries and fallbacks
- output validation
- trace correlation
Keep it boring. A deterministic state machine is easier to debug than a pile of hidden framework callbacks. For many systems, this is enough:
classify request
-> decide route
-> retrieve evidence if needed
-> assemble prompt
-> call model
-> execute approved tool calls
-> validate output
-> stream final response
-> record trace and costAgents can be useful, but do not make every request an agent loop. A password reset question, a refund policy lookup, and a code migration task do not need the same control flow.
Step 4: Put a gateway between your app and model providers
An LLM gateway pays for itself quickly. It should handle:
- provider credentials
- routing by model, tenant, feature, and region
- retries with backoff
- fallback models
- timeout budgets
- token accounting
- per-tenant cost attribution
- audit logs
Do not let every service call model providers directly. That creates duplicate retry logic, inconsistent safety policy, and impossible cost attribution.
The gateway also owns idempotency. Token-charging APIs make duplicate retries expensive. If the client retries the same request after a timeout, the backend should be able to return the prior result or resume the stream instead of generating twice.
Step 5: Add graceful degradation
There are four failure modes worth designing on day one:
| Failure | Good degradation |
|---|---|
| Primary model unavailable | Route to a smaller or alternate provider model |
| Retrieval unavailable | Answer only if the task is safe without retrieval, otherwise explain the limitation |
| Tool unavailable | Continue with a partial answer only if the tool is optional |
| Streaming interrupted | Let the client reconnect with request id and recover final state |
Graceful degradation is not fake confidence. If the answer needs fresh retrieved evidence, say the system cannot complete the task rather than producing a polished guess.
Step 6: Define the minimum telemetry contract
Every request should produce a trace with:
- tenant id and feature name
- prompt version
- model route and provider
- retrieval query, document ids, and rerank scores
- tool calls, arguments, status, and latency
- TTFT, TPOT, total latency
- input tokens, output tokens, cached tokens
- final status and user feedback
Redact sensitive fields before storage. Observability that stores raw private prompts without a policy will eventually become the incident.
A practical build order
Build in this order:
- Gateway with auth, rate limits, and request ids.
- Basic streaming model call.
- Conversation session store.
- Prompt registry with versioned templates.
- Retrieval path.
- Tool execution path.
- Tracing and cost attribution.
- Fallback model chain.
- Eval set and release gate.
- Human escalation for low-confidence or high-risk outputs.
The order matters. If you add retrieval and tools before request identity, tracing, and idempotency, debugging becomes guesswork.
Sources and receipts
- OpenAI, “Streaming API responses”: https://platform.openai.com/docs/guides/streaming-responses
- OpenAI API Reference, “Streaming events”: https://platform.openai.com/docs/api-reference/responses-streaming
- MDN, “Using server-sent events”: https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events
- MDN, “WebSocket”: https://developer.mozilla.org/docs/Web/API/WebSocket
- AWS Well-Architected, “Make all responses idempotent”: https://docs.aws.amazon.com/wellarchitected/2023-04-10/framework/rel_prevent_interaction_failure_idempotent.html
