Skip to content
Production LLM Systems Tutorial 3: Scalable Inference Architecture

Production LLM Systems Tutorial 3: Scalable Inference Architecture

Tutorial Series

  1. End-to-End Application Design
  2. Latency, Cost, and Quality
  3. Scalable Inference Architecture
  4. RAG and Data Pipelines
  5. Monitoring and Observability
  6. Evaluation and A/B Testing
  7. Security and Prompt Injection
  8. Human-in-the-Loop Workflows
  9. Cost Optimization
  10. Versioning and Disaster Recovery

Scalable LLM inference is not “put a model behind an endpoint.”

The hard parts are memory, queueing, routing, cold starts, and isolation. GPUs are expensive, but idle GPUs are worse. A serving architecture must keep accelerators busy without destroying TTFT, overflowing KV cache, or giving one tenant the whole cluster.

This tutorial builds the inference layer from the GPU up.

Scalable LLM inference fleet with gateway, router, scheduler, GPU workers, KV cache pressure, queue depth, and autoscaling signals
Scalable inference is a fleet problem: routing, scheduling, cache locality, and memory pressure determine service behavior.

The serving stack

A practical stack looks like this:

LLM gateway
  -> router
       route by model, tenant, prompt prefix, queue, KV locality
  -> scheduler
       admission control, batching, priorities
  -> inference workers
       vLLM, TensorRT-LLM, SGLang, TGI, Triton
  -> GPU runtime
       CUDA kernels, NCCL, memory manager
  -> observability
       tokens/sec, TTFT, TPOT, KV cache, queue depth

The router and scheduler are as important as the model server. If you send requests randomly, you waste prefix cache, increase queueing, and make tail latency look mysterious.

Choose the serving engine by workload

EngineGood fitWatch out for
vLLMHigh-throughput serving, continuous batching, PagedAttention, broad model supportFeature compatibility varies by model and backend
TensorRT-LLMOptimized NVIDIA GPU inference, engine-level performance tuningBuild and deployment workflow is more specialized
SGLangStructured generation, prefix reuse, complex generation programsOperational maturity depends on your stack and hardware
TGIHugging Face ecosystem integrationThroughput features vary by model and deployment
TritonMulti-model inference serving and custom backendsYou still need model-specific optimization

The right answer is usually workload-specific. Benchmark your prompt length distribution, output length distribution, concurrency, and model family. A leaderboard result without your traffic shape is a weak signal.

Understand KV cache pressure

During generation, each request accumulates key and value tensors for prior tokens. That KV cache grows with:

  • batch size
  • sequence length
  • number of layers
  • hidden size and heads
  • precision
  • number of active sessions

A rough mental model:

KV cache memory ~= tokens * layers * kv_heads * head_dim * 2 * bytes

The exact formula depends on model architecture, but the lesson is stable: long-context serving is often memory-bound before it is compute-bound.

PagedAttention, used by vLLM, treats KV cache more like paged virtual memory. Instead of reserving one large contiguous block per request, it manages blocks to reduce fragmentation and improve batching.

KV cache pressure diagram showing memory growth with tokens, layers, KV heads, head dimension, precision, and active sessions
Long-context serving often becomes memory-bound through KV cache growth before compute is fully saturated.

Pick the parallelism pattern

Large models need parallelism. Use the smallest parallelism that fits the model and throughput target.

PatternWhat it splitsUse it when
Tensor parallelismMatrix operations across GPUsModel fits across GPUs in one node or tightly connected nodes
Pipeline parallelismLayers across stagesModel is too deep or large for tensor parallel alone
Expert parallelismMoE experts across devicesServing mixture-of-experts models
Sequence parallelismSequence dimension workLong-context workloads and specialized kernels

Tensor parallelism within a node is common because NVLink/NVSwitch bandwidth helps. Pipeline parallelism across nodes can work, but bubbles and network latency matter. Expert parallelism is relevant for MoE models where only a subset of experts activates per token.

Autoscale on the right signal

CPU utilization is the wrong primary signal for LLM serving.

Use:

  • queue depth by model route
  • waiting time before prefill
  • TTFT p95
  • TPOT p95
  • GPU memory headroom
  • KV cache utilization
  • active sequences
  • tokens per second

Cold start is brutal for large models. Loading 70B-class weights can take tens of seconds depending on storage, network, framework, quantization, and engine build. Scale from zero is usually not acceptable for interactive routes.

Use warm pools:

min replicas: enough for baseline traffic
warm standby: loaded model, not fully saturated
scale out: based on queue and TTFT
scale in: slow, with drain period

For self-hosted fleets, model loading is part of capacity planning. A pod that exists but has not loaded weights is not capacity.

Route with KV locality

Round-robin is simple. It is also often wrong for chat.

If turn one of a conversation lands on worker A, worker A may hold useful prefix cache for turn two. Sending turn two to worker B forces a cold prefill. KV-aware routing scores workers by:

  • prompt prefix overlap
  • session affinity
  • current queue
  • memory pressure
  • expected output length
  • tenant priority

A basic scoring function:

score(worker) =
  0.45 * prefix_cache_match
  - 0.25 * queue_delay
  - 0.20 * kv_pressure
  + 0.10 * tenant_priority

The weights will vary, but the principle is the same: cache locality is a serving signal.

Multi-tenant serving

Tenant isolation needs more than auth.

At inference time, enforce:

  • per-tenant rate limits
  • max input tokens
  • max output tokens
  • max concurrent requests
  • model allowlists
  • LoRA adapter allowlists
  • prompt namespace isolation
  • cache namespace isolation
  • trace and cost separation

For fine-tuned tenant adapters, multi-LoRA serving can improve utilization by serving many adapters on shared base weights. But adapter routing must be strict. One tenant should never receive another tenant’s adapter or cached output.

Failure modes to test

FailureTest
One worker OOMs under long contextVerify admission control rejects or routes before OOM
Provider route slows downVerify fallback and circuit breaker
KV cache fillsVerify scheduler sheds low-priority or long-output traffic
New model rollout regresses TPOTCanary by route and compare token metrics
Tenant spikes trafficVerify quota protects other tenants

The serving layer is successful when overload is boring. Requests should queue, degrade, or fail intentionally instead of taking the fleet down.

Sources and receipts