Engineering TokensModelsAI Apps at Cloud Scale
Working notes on AI inference, LLMs, agents, cloud systems, GPUs, and the runtime stack that turns model weights into reliable, predictable services at scale.
Inference Engines
NVIDIA TensorRT-LLM and Dynamo, NIM, vLLM, SGLang — comparing throughput, latency, batching, and memory under realistic load.
LLM Internals
Attention, KV caches, speculative decoding, quantization (FP8, INT4, AWQ, GPTQ), and how they trade accuracy for speed.
GPU Systems
CUDA, NCCL, Triton kernels, NVLink and NVSwitch topology — and what actually limits throughput when scaling past one H100/H200/B200 box.
Tokenomics
Tokens per second, throughput per dollar, GPU utilization, and the cost and capacity math behind production inference.
Agentic Systems
Tool use, function calling, planning, multi-agent orchestration, and the runtime patterns for production agents on top of LLM inference.
The Archive
Earlier work on cloud architecture, Kubernetes, Go, and system design. URLs preserved.
Recent writing
Agents Need Seatbelts: Guardrails and Infinite-Loop Detection for Tool-Using AI
A systems guide to agent guardrails, tool authorization, loop detection, budgets, recursion limits, no-progress checks, and production safety controls.
Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste
A production guide to hidden token leaks: retries, tool loops, unused RAG context, schema bloat, cached-token accounting, reasoning tokens, and cost observability.
Reduce LLM Inference Cost by 60% Without Serving Stale Answers
A production architecture for semantic response caching, freshness gates, prompt/KV reuse, and cache-aware routing when users ask the same questions in different words.
Latest notes
17 more posts- 2026 ai Why Agentic AI Is Bringing CPUs Back Into the Spotlight Why agentic AI makes CPUs more important again, and why modern CPU+GPU platforms are starting to look like the natural shape of the AI cloud.
- 2026 startups YC's 2026 Startup Map: AI Has Left the Chatbox A talent and market-trends read on Y Combinator's latest Requests for Startups: AI-native services, company brains, agent software, chips, hardware, medicine, agriculture, space, and enterprise workflows.
- 2026 ai Your RAG Demo Passed. Your RAG System Needs a Judge: RAGAS, Humans, and Evidence A deep guide to evaluating RAG systems with RAGAS, retrieval metrics, faithfulness checks, human evals, golden datasets, and production feedback loops.
- 2026 ai Agentic AI Needs Smarter Inference: Hints, Priority, and Cache Lifecycle Why agentic workloads need inference runtimes that understand priority, expected output length, speculative prefill, and KV cache lifecycle instead of treating every request the same.
- 2026 ai Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production A practical comparison of speculative decoding and quantization: what each optimizes, where each fails, how they interact, and what to measure before rollout.
- 2026 ai KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform A deep systems guide to KV cache capacity, PagedAttention, prefix reuse, eviction, offload, routing, and fleet-level cache management for LLM inference.
- 2026 ai The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You A production guide to prompt caching, context caching, semantic response caching, exact caching, and the tradeoffs that decide latency, cost, freshness, and correctness.
- 2026 ai Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization Why LLM serving needs autoscaling based on first-token and per-token latency, and how Dynamo Planner points toward SLO-aware capacity control.
- 2026 inference From H100 to Blackwell: What Actually Changes for Inference Architects A practical architecture view of the shift from H100/H200 to Blackwell: memory, precision, NVLink scale-up, MoE, software, and cost per token.
- 2026 inference Speculative Decoding in Production: When Draft Tokens Help and When They Hurt A practical guide to speculative decoding: why it speeds up autoregressive generation, how to measure acceptance rate, and when it creates operational complexity.
- 2026 ai From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem Why splitting prefill and decode can improve LLM serving, and why the real challenge is KV transfer, topology, scheduling, and SLO-aware operation.
- 2026 rust The Rust Case for AI Gateways: Backpressure, Streaming, and Failure Isolation Why an AI gateway sits on a systems boundary where Rust's ownership, async, cancellation, and no-GC profile become practical advantages.
- 2026 ai Why Round-Robin Dies in LLM Serving: KV-Aware Routing Explained A deep but practical explanation of why LLM routing must account for KV cache overlap, prefill cost, decode load, and SLO risk instead of simply rotating requests across workers.
- 2026 talent-management What AI-Native Talent Looks Like in 2026: A Recruiter's Field Guide A practical guide for talent leaders on identifying AI-native candidates: workflow literacy, judgment, AI-assisted execution, and human skills that still matter.
- 2026 inference TensorRT-LLM vs vLLM vs SGLang: Choosing an Inference Engine for Production A practical comparison of TensorRT-LLM, vLLM, and SGLang across performance, portability, structured generation, cache reuse, deployment, and operations.
- 2026 ai Dynamo Is Not an Inference Engine. It Is the Control Plane for Tokens Why NVIDIA Dynamo is best understood as the distributed control plane around LLM inference engines, not as another engine competing with vLLM, SGLang, or TensorRT-LLM.
- 2025 talent-acquisition The AI Hiring Playbook for 2026: Skills, Signals, and Fewer Shiny Job Titles A practical talent-acquisition playbook for hiring in the AI era: skills-first scorecards, better work samples, recruiter judgment, and fewer inflated AI job titles.
From the Archive
43 postsEarlier writing on cloud architecture, Kubernetes, Go, and system design — kept here for reference. Hover to bring an item back into focus.