Production LLM Systems Tutorial 3: Scalable Inference Architecture
Tutorial Series
- End-to-End Application Design
- Latency, Cost, and Quality
- Scalable Inference Architecture
- RAG and Data Pipelines
- Monitoring and Observability
- Evaluation and A/B Testing
- Security and Prompt Injection
- Human-in-the-Loop Workflows
- Cost Optimization
- Versioning and Disaster Recovery
Scalable LLM inference is not “put a model behind an endpoint.”
The hard parts are memory, queueing, routing, cold starts, and isolation. GPUs are expensive, but idle GPUs are worse. A serving architecture must keep accelerators busy without destroying TTFT, overflowing KV cache, or giving one tenant the whole cluster.
This tutorial builds the inference layer from the GPU up.
The serving stack
A practical stack looks like this:
LLM gateway
-> router
route by model, tenant, prompt prefix, queue, KV locality
-> scheduler
admission control, batching, priorities
-> inference workers
vLLM, TensorRT-LLM, SGLang, TGI, Triton
-> GPU runtime
CUDA kernels, NCCL, memory manager
-> observability
tokens/sec, TTFT, TPOT, KV cache, queue depthThe router and scheduler are as important as the model server. If you send requests randomly, you waste prefix cache, increase queueing, and make tail latency look mysterious.
Choose the serving engine by workload
| Engine | Good fit | Watch out for |
|---|---|---|
| vLLM | High-throughput serving, continuous batching, PagedAttention, broad model support | Feature compatibility varies by model and backend |
| TensorRT-LLM | Optimized NVIDIA GPU inference, engine-level performance tuning | Build and deployment workflow is more specialized |
| SGLang | Structured generation, prefix reuse, complex generation programs | Operational maturity depends on your stack and hardware |
| TGI | Hugging Face ecosystem integration | Throughput features vary by model and deployment |
| Triton | Multi-model inference serving and custom backends | You still need model-specific optimization |
The right answer is usually workload-specific. Benchmark your prompt length distribution, output length distribution, concurrency, and model family. A leaderboard result without your traffic shape is a weak signal.
Understand KV cache pressure
During generation, each request accumulates key and value tensors for prior tokens. That KV cache grows with:
- batch size
- sequence length
- number of layers
- hidden size and heads
- precision
- number of active sessions
A rough mental model:
KV cache memory ~= tokens * layers * kv_heads * head_dim * 2 * bytesThe exact formula depends on model architecture, but the lesson is stable: long-context serving is often memory-bound before it is compute-bound.
PagedAttention, used by vLLM, treats KV cache more like paged virtual memory. Instead of reserving one large contiguous block per request, it manages blocks to reduce fragmentation and improve batching.
Pick the parallelism pattern
Large models need parallelism. Use the smallest parallelism that fits the model and throughput target.
| Pattern | What it splits | Use it when |
|---|---|---|
| Tensor parallelism | Matrix operations across GPUs | Model fits across GPUs in one node or tightly connected nodes |
| Pipeline parallelism | Layers across stages | Model is too deep or large for tensor parallel alone |
| Expert parallelism | MoE experts across devices | Serving mixture-of-experts models |
| Sequence parallelism | Sequence dimension work | Long-context workloads and specialized kernels |
Tensor parallelism within a node is common because NVLink/NVSwitch bandwidth helps. Pipeline parallelism across nodes can work, but bubbles and network latency matter. Expert parallelism is relevant for MoE models where only a subset of experts activates per token.
Autoscale on the right signal
CPU utilization is the wrong primary signal for LLM serving.
Use:
- queue depth by model route
- waiting time before prefill
- TTFT p95
- TPOT p95
- GPU memory headroom
- KV cache utilization
- active sequences
- tokens per second
Cold start is brutal for large models. Loading 70B-class weights can take tens of seconds depending on storage, network, framework, quantization, and engine build. Scale from zero is usually not acceptable for interactive routes.
Use warm pools:
min replicas: enough for baseline traffic
warm standby: loaded model, not fully saturated
scale out: based on queue and TTFT
scale in: slow, with drain periodFor self-hosted fleets, model loading is part of capacity planning. A pod that exists but has not loaded weights is not capacity.
Route with KV locality
Round-robin is simple. It is also often wrong for chat.
If turn one of a conversation lands on worker A, worker A may hold useful prefix cache for turn two. Sending turn two to worker B forces a cold prefill. KV-aware routing scores workers by:
- prompt prefix overlap
- session affinity
- current queue
- memory pressure
- expected output length
- tenant priority
A basic scoring function:
score(worker) =
0.45 * prefix_cache_match
- 0.25 * queue_delay
- 0.20 * kv_pressure
+ 0.10 * tenant_priorityThe weights will vary, but the principle is the same: cache locality is a serving signal.
Multi-tenant serving
Tenant isolation needs more than auth.
At inference time, enforce:
- per-tenant rate limits
- max input tokens
- max output tokens
- max concurrent requests
- model allowlists
- LoRA adapter allowlists
- prompt namespace isolation
- cache namespace isolation
- trace and cost separation
For fine-tuned tenant adapters, multi-LoRA serving can improve utilization by serving many adapters on shared base weights. But adapter routing must be strict. One tenant should never receive another tenant’s adapter or cached output.
Failure modes to test
| Failure | Test |
|---|---|
| One worker OOMs under long context | Verify admission control rejects or routes before OOM |
| Provider route slows down | Verify fallback and circuit breaker |
| KV cache fills | Verify scheduler sheds low-priority or long-output traffic |
| New model rollout regresses TPOT | Canary by route and compare token metrics |
| Tenant spikes traffic | Verify quota protects other tenants |
The serving layer is successful when overload is boring. Requests should queue, degrade, or fail intentionally instead of taking the fleet down.
Sources and receipts
- vLLM documentation: https://docs.vllm.ai/
- Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention”: https://arxiv.org/abs/2309.06180
- NVIDIA TensorRT-LLM documentation: https://docs.nvidia.com/tensorrt-llm/
- NVIDIA Triton TensorRT-LLM backend guide: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/trtllm_user_guide.html
- SGLang documentation: https://docs.sglang.io/
