Engineering TokensModelsAI Apps at Cloud Scale
Working notes on AI inference, LLMs, agents, cloud systems, GPUs, and the runtime stack that turns model weights into reliable, predictable services at scale.
Inference Engines
NVIDIA TensorRT-LLM and Dynamo, NIM, vLLM, SGLang — comparing throughput, latency, batching, and memory under realistic load.
LLM Internals
Attention, KV caches, speculative decoding, quantization (FP8, INT4, AWQ, GPTQ), and how they trade accuracy for speed.
GPU Systems
CUDA, NCCL, Triton kernels, NVLink and NVSwitch topology — and what actually limits throughput when scaling past one H100/H200/B200 box.
Tokenomics
Tokens per second, throughput per dollar, GPU utilization, and the cost and capacity math behind production inference.
Agentic Systems
Tool use, function calling, planning, multi-agent orchestration, and the runtime patterns for production agents on top of LLM inference.
The Archive
Earlier work on cloud architecture, Kubernetes, Go, and system design. URLs preserved.
Recent writing
Production LLM Systems Tutorial 1: End-to-End Application Design
A practical tutorial for designing an end-to-end LLM application with gateway, orchestration, retrieval, tools, inference, caching, telemetry, and failure handling.
Production LLM Systems Tutorial 2: Latency, Cost, and Quality
A practical tutorial on the latency, cost, and quality trade-offs behind model routing, caching, batching, quantization, speculative decoding, and prompt compression.
Production LLM Systems Tutorial 3: Scalable Inference Architecture
A tutorial on scalable LLM inference with vLLM, TensorRT-LLM, SGLang, KV cache management, parallelism, autoscaling, routing, and multi-tenant serving.
Latest notes
17 more posts- 2026 ai Production LLM Systems Tutorial 4: RAG and Data Pipelines A practical tutorial for building a production RAG pipeline with ingestion, chunking, embeddings, hybrid search, reranking, metadata filters, and index versioning.
- 2026 ai Production LLM Systems Tutorial 5: Monitoring and Observability A tutorial for monitoring LLM applications across system metrics, quality signals, drift, tracing, privacy, and cost attribution.
- 2026 ai Production LLM Systems Tutorial 6: Evaluation and A/B Testing A tutorial for building offline evals, online experiments, regression gates, judge calibration, RAGAS-style metrics, and release workflows for LLM systems.
- 2026 ai Production LLM Systems Tutorial 7: Security and Prompt Injection A practical tutorial for defending LLM systems against direct and indirect prompt injection, data exfiltration, unsafe tool calls, and privilege escalation.
- 2026 ai Production LLM Systems Tutorial 8: Human-in-the-Loop Workflows A tutorial for designing human-in-the-loop LLM workflows with confidence routing, escalation queues, review UX, active learning, and approval gates.
- 2026 ai Production LLM Systems Tutorial 9: Cost Optimization A tutorial on reducing LLM application cost with routing, caching, prompt budgeting, batch processing, quantization, attribution, and token guardrails.
- 2026 ai Production LLM Systems Tutorial 10: Versioning and Disaster Recovery A tutorial for versioning models, prompts, embeddings, retrieval indexes, tools, and policies while designing fallback, rollback, and graceful degradation for LLM systems.
- 2026 ai Agents Need Seatbelts: Guardrails and Infinite-Loop Detection for Tool-Using AI A systems guide to agent guardrails, tool authorization, loop detection, budgets, recursion limits, no-progress checks, and production safety controls.
- 2026 ai Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste A production guide to hidden token leaks: retries, tool loops, unused RAG context, schema bloat, cached-token accounting, reasoning tokens, and cost observability.
- 2026 ai Reduce LLM Inference Cost by 60% Without Serving Stale Answers A production architecture for semantic response caching, freshness gates, prompt/KV reuse, and cache-aware routing when users ask the same questions in different words.
- 2026 ai Why Agentic AI Is Bringing CPUs Back Into the Spotlight Why agentic AI makes CPUs more important again, and why modern CPU+GPU platforms are starting to look like the natural shape of the AI cloud.
- 2026 startups YC's 2026 Startup Map: AI Has Left the Chatbox A talent and market-trends read on Y Combinator's latest Requests for Startups: AI-native services, company brains, agent software, chips, hardware, medicine, agriculture, space, and enterprise workflows.
- 2026 ai Your RAG Demo Passed. Your RAG System Needs a Judge: RAGAS, Humans, and Evidence A deep guide to evaluating RAG systems with RAGAS, retrieval metrics, faithfulness checks, human evals, golden datasets, and production feedback loops.
- 2026 ai Agentic AI Needs Smarter Inference: Hints, Priority, and Cache Lifecycle Why agentic workloads need inference runtimes that understand priority, expected output length, speculative prefill, and KV cache lifecycle instead of treating every request the same.
- 2026 ai Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production A practical comparison of speculative decoding and quantization: what each optimizes, where each fails, how they interact, and what to measure before rollout.
- 2026 ai KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform A deep systems guide to KV cache capacity, PagedAttention, prefix reuse, eviction, offload, routing, and fleet-level cache management for LLM inference.
- 2026 ai The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You A production guide to prompt caching, context caching, semantic response caching, exact caching, and the tradeoffs that decide latency, cost, freshness, and correctness.
From the Archive
43 postsEarlier writing on cloud architecture, Kubernetes, Go, and system design — kept here for reference. Hover to bring an item back into focus.