Production LLM Systems Tutorial 1: End-to-End Application Design

#ai #llm #systems-design #tutorial #inference #agents

Tutorial Series

Most LLM applications fail in the gaps between components.

The model may be excellent. The prompt may look clean. The demo may answer one question beautifully. Then production arrives: users refresh, requests retry, providers throttle, tools time out, RAG returns stale evidence, streaming breaks halfway through a response, and your token bill records every mistake.

This tutorial builds the system shape that survives those conditions.

End-to-end production LLM application architecture with client, gateway, orchestrator, retrieval, tools, LLM gateway, inference, streaming, and telemetry — A production LLM application is a request path with state, policy, retrieval, tools, model routing, and telemetry around the model call.

The target architecture

Start with the full request path:

client
  -> API gateway
       auth, rate limits, tenant identity, idempotency key
  -> conversation service
       session state, history compaction, policy context
  -> orchestrator
       prompt assembly, retrieval, tool planning, model routing
  -> retrieval layer
       query rewrite, hybrid search, rerank, context packing
  -> tool layer
       schemas, permissions, execution, retries
  -> LLM gateway
       provider abstraction, fallback, cost attribution
  -> inference endpoint
       hosted API or self-served model
  -> response stream
       SSE or WebSocket, partial events, final accounting
  -> telemetry
       traces, quality signals, cost, feedback

Do not draw “app -> LLM -> answer.” That hides every decision that matters.

Step 1: Choose the interaction contract

For chat-like applications, streaming is usually the default. The user feels time to first token before they feel total latency. A 700 ms TTFT with a 6 second full answer often feels better than a silent 3 second request-response call.

Use Server-Sent Events when the server mostly streams tokens and status events to the browser. Use WebSocket when the client also needs low-latency bidirectional events, such as collaborative editing, voice turn-taking, or live tool state.

A good stream emits more than tokens:

{"type":"message.started","request_id":"req_123"}
{"type":"retrieval.completed","documents":5}
{"type":"output_text.delta","text":"The"}
{"type":"output_text.delta","text":" answer"}
{"type":"usage.completed","input_tokens":1840,"output_tokens":322}
{"type":"message.completed"}

That event model gives the UI progress, gives observability clean spans, and gives clients a reliable way to recover from partial output.

Typed streaming event contract showing started, retrieval, token delta, tool, usage, and terminal events — Streaming should expose typed lifecycle events, not just raw token chunks.

Step 2: Treat conversation state as product data

There are three common state models:

Model	What it means	Use it when
Client-held history	The browser sends prior messages each turn	Low-risk consumer apps, simple prototypes
Server session store	Backend stores messages and sends a compact window to the model	Most production apps
Retrieved memory	Past turns are embedded and retrieved like documents	Long-lived assistants and workflow systems

Do not blindly replay the full conversation. Context windows are larger now, but cost and attention quality still matter. Use a sliding window for recent turns, a summary for older turns, and retrieval for specific prior facts.

The dangerous part is tenant separation. Every session, prompt template, retrieved document, tool call, and cache entry needs a tenant boundary. If you share a semantic cache across tenants without strict namespacing, you have built a data leak with a nice latency profile.

Step 3: Make the orchestrator explicit

The orchestrator is where the application becomes more than a prompt wrapper. It owns:

prompt version selection
model routing
retrieval decisions
tool-call permissions
context budget allocation
retries and fallbacks
output validation
trace correlation

Keep it boring. A deterministic state machine is easier to debug than a pile of hidden framework callbacks. For many systems, this is enough:

classify request
  -> decide route
  -> retrieve evidence if needed
  -> assemble prompt
  -> call model
  -> execute approved tool calls
  -> validate output
  -> stream final response
  -> record trace and cost

Agents can be useful, but do not make every request an agent loop. A password reset question, a refund policy lookup, and a code migration task do not need the same control flow.

Step 4: Put a gateway between your app and model providers

An LLM gateway pays for itself quickly. It should handle:

provider credentials
routing by model, tenant, feature, and region
retries with backoff
fallback models
timeout budgets
token accounting
per-tenant cost attribution
audit logs

Do not let every service call model providers directly. That creates duplicate retry logic, inconsistent safety policy, and impossible cost attribution.

The gateway also owns idempotency. Token-charging APIs make duplicate retries expensive. If the client retries the same request after a timeout, the backend should be able to return the prior result or resume the stream instead of generating twice.

Step 5: Add graceful degradation

There are four failure modes worth designing on day one:

Failure	Good degradation
Primary model unavailable	Route to a smaller or alternate provider model
Retrieval unavailable	Answer only if the task is safe without retrieval, otherwise explain the limitation
Tool unavailable	Continue with a partial answer only if the tool is optional
Streaming interrupted	Let the client reconnect with request id and recover final state

Graceful degradation is not fake confidence. If the answer needs fresh retrieved evidence, say the system cannot complete the task rather than producing a polished guess.

Step 6: Define the minimum telemetry contract

Every request should produce a trace with:

tenant id and feature name
prompt version
model route and provider
retrieval query, document ids, and rerank scores
tool calls, arguments, status, and latency
TTFT, TPOT, total latency
input tokens, output tokens, cached tokens
final status and user feedback

Redact sensitive fields before storage. Observability that stores raw private prompts without a policy will eventually become the incident.

A practical build order

Build in this order:

Gateway with auth, rate limits, and request ids.
Basic streaming model call.
Conversation session store.
Prompt registry with versioned templates.
Retrieval path.
Tool execution path.
Tracing and cost attribution.
Fallback model chain.
Eval set and release gate.
Human escalation for low-confidence or high-risk outputs.

The order matters. If you add retrieval and tools before request identity, tracing, and idempotency, debugging becomes guesswork.

Sources and receipts

OpenAI, “Streaming API responses”: https://platform.openai.com/docs/guides/streaming-responses
OpenAI API Reference, “Streaming events”: https://platform.openai.com/docs/api-reference/responses-streaming
MDN, “Using server-sent events”: https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events
MDN, “WebSocket”: https://developer.mozilla.org/docs/Web/API/WebSocket
AWS Well-Architected, “Make all responses idempotent”: https://docs.aws.amazon.com/wellarchitected/2023-04-10/framework/rel_prevent_interaction_failure_idempotent.html

Production LLM Systems Tutorial 2: Latency, Cost, and Quality