Production LLM Systems Tutorial 3: Scalable Inference Architecture

#ai #llm #inference #gpu #vllm #tensorrt-llm #tutorial

Tutorial Series

Scalable LLM inference is not “put a model behind an endpoint.”

The hard parts are memory, queueing, routing, cold starts, and isolation. GPUs are expensive, but idle GPUs are worse. A serving architecture must keep accelerators busy without destroying TTFT, overflowing KV cache, or giving one tenant the whole cluster.

This tutorial builds the inference layer from the GPU up.

Scalable LLM inference fleet with gateway, router, scheduler, GPU workers, KV cache pressure, queue depth, and autoscaling signals — Scalable inference is a fleet problem: routing, scheduling, cache locality, and memory pressure determine service behavior.

The serving stack

A practical stack looks like this:

LLM gateway
  -> router
       route by model, tenant, prompt prefix, queue, KV locality
  -> scheduler
       admission control, batching, priorities
  -> inference workers
       vLLM, TensorRT-LLM, SGLang, TGI, Triton
  -> GPU runtime
       CUDA kernels, NCCL, memory manager
  -> observability
       tokens/sec, TTFT, TPOT, KV cache, queue depth

The router and scheduler are as important as the model server. If you send requests randomly, you waste prefix cache, increase queueing, and make tail latency look mysterious.

Choose the serving engine by workload

Engine	Good fit	Watch out for
vLLM	High-throughput serving, continuous batching, PagedAttention, broad model support	Feature compatibility varies by model and backend
TensorRT-LLM	Optimized NVIDIA GPU inference, engine-level performance tuning	Build and deployment workflow is more specialized
SGLang	Structured generation, prefix reuse, complex generation programs	Operational maturity depends on your stack and hardware
TGI	Hugging Face ecosystem integration	Throughput features vary by model and deployment
Triton	Multi-model inference serving and custom backends	You still need model-specific optimization

The right answer is usually workload-specific. Benchmark your prompt length distribution, output length distribution, concurrency, and model family. A leaderboard result without your traffic shape is a weak signal.

Understand KV cache pressure

During generation, each request accumulates key and value tensors for prior tokens. That KV cache grows with:

batch size
sequence length
number of layers
hidden size and heads
precision
number of active sessions

A rough mental model:

KV cache memory ~= tokens * layers * kv_heads * head_dim * 2 * bytes

The exact formula depends on model architecture, but the lesson is stable: long-context serving is often memory-bound before it is compute-bound.

PagedAttention, used by vLLM, treats KV cache more like paged virtual memory. Instead of reserving one large contiguous block per request, it manages blocks to reduce fragmentation and improve batching.

KV cache pressure diagram showing memory growth with tokens, layers, KV heads, head dimension, precision, and active sessions — Long-context serving often becomes memory-bound through KV cache growth before compute is fully saturated.

Pick the parallelism pattern

Large models need parallelism. Use the smallest parallelism that fits the model and throughput target.

Pattern	What it splits	Use it when
Tensor parallelism	Matrix operations across GPUs	Model fits across GPUs in one node or tightly connected nodes
Pipeline parallelism	Layers across stages	Model is too deep or large for tensor parallel alone
Expert parallelism	MoE experts across devices	Serving mixture-of-experts models
Sequence parallelism	Sequence dimension work	Long-context workloads and specialized kernels

Tensor parallelism within a node is common because NVLink/NVSwitch bandwidth helps. Pipeline parallelism across nodes can work, but bubbles and network latency matter. Expert parallelism is relevant for MoE models where only a subset of experts activates per token.

Autoscale on the right signal

CPU utilization is the wrong primary signal for LLM serving.

Use:

queue depth by model route
waiting time before prefill
TTFT p95
TPOT p95
GPU memory headroom
KV cache utilization
active sequences
tokens per second

Cold start is brutal for large models. Loading 70B-class weights can take tens of seconds depending on storage, network, framework, quantization, and engine build. Scale from zero is usually not acceptable for interactive routes.

Use warm pools:

min replicas: enough for baseline traffic
warm standby: loaded model, not fully saturated
scale out: based on queue and TTFT
scale in: slow, with drain period

For self-hosted fleets, model loading is part of capacity planning. A pod that exists but has not loaded weights is not capacity.

Route with KV locality

Round-robin is simple. It is also often wrong for chat.

If turn one of a conversation lands on worker A, worker A may hold useful prefix cache for turn two. Sending turn two to worker B forces a cold prefill. KV-aware routing scores workers by:

prompt prefix overlap
session affinity
current queue
memory pressure
expected output length
tenant priority

A basic scoring function:

score(worker) =
  0.45 * prefix_cache_match
  - 0.25 * queue_delay
  - 0.20 * kv_pressure
  + 0.10 * tenant_priority

The weights will vary, but the principle is the same: cache locality is a serving signal.

Multi-tenant serving

Tenant isolation needs more than auth.

At inference time, enforce:

per-tenant rate limits
max input tokens
max output tokens
max concurrent requests
model allowlists
LoRA adapter allowlists
prompt namespace isolation
cache namespace isolation
trace and cost separation

For fine-tuned tenant adapters, multi-LoRA serving can improve utilization by serving many adapters on shared base weights. But adapter routing must be strict. One tenant should never receive another tenant’s adapter or cached output.

Failure modes to test

Failure	Test
One worker OOMs under long context	Verify admission control rejects or routes before OOM
Provider route slows down	Verify fallback and circuit breaker
KV cache fills	Verify scheduler sheds low-priority or long-output traffic
New model rollout regresses TPOT	Canary by route and compare token metrics
Tenant spikes traffic	Verify quota protects other tenants

The serving layer is successful when overload is boring. Requests should queue, degrade, or fail intentionally instead of taking the fleet down.

Sources and receipts

vLLM documentation: https://docs.vllm.ai/
Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention”: https://arxiv.org/abs/2309.06180
NVIDIA TensorRT-LLM documentation: https://docs.nvidia.com/tensorrt-llm/
NVIDIA Triton TensorRT-LLM backend guide: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/trtllm_user_guide.html
SGLang documentation: https://docs.sglang.io/

Production LLM Systems Tutorial 2: Latency, Cost, and Quality Production LLM Systems Tutorial 4: RAG and Data Pipelines