Engineering TokensModelsAI Apps at Cloud Scale
Working notes on AI inference, LLMs, agents, cloud systems, GPUs, and the runtime stack that turns model weights into reliable, predictable services at scale.
Inference Engines
NVIDIA TensorRT-LLM and Dynamo, NIM, vLLM, SGLang — comparing throughput, latency, batching, and memory under realistic load.
LLM Internals
Attention, KV caches, speculative decoding, quantization (FP8, INT4, AWQ, GPTQ), and how they trade accuracy for speed.
GPU Systems
CUDA, NCCL, Triton kernels, NVLink and NVSwitch topology — and what actually limits throughput when scaling past one H100/H200/B200 box.
Tokenomics
Tokens per second, throughput per dollar, GPU utilization, and the cost and capacity math behind production inference.
Agentic Systems
Tool use, function calling, planning, multi-agent orchestration, and the runtime patterns for production agents on top of LLM inference.
The Archive
Earlier work on cloud architecture, Kubernetes, Go, and system design. URLs preserved.
Recent writing
Reduce LLM Inference Cost by 60% Without Serving Stale Answers
A production architecture for semantic response caching, freshness gates, prompt/KV reuse, and cache-aware routing when users ask the same questions in different words.
Why Agentic AI Is Bringing CPUs Back Into the Spotlight
Why agentic AI makes CPUs more important again, and why modern CPU+GPU platforms are starting to look like the natural shape of the AI cloud.
YC's 2026 Startup Map: AI Has Left the Chatbox
A talent and market-trends read on Y Combinator's latest Requests for Startups: AI-native services, company brains, agent software, chips, hardware, medicine, agriculture, space, and enterprise workflows.
Latest notes
17 more postsApril 2026 1 posts
March 2026 2 posts
- ai Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization Why LLM serving needs autoscaling based on first-token and per-token latency, and how Dynamo Planner points toward SLO-aware capacity control.
- inference From H100 to Blackwell: What Actually Changes for Inference Architects A practical architecture view of the shift from H100/H200 to Blackwell: memory, precision, NVLink scale-up, MoE, software, and cost per token.
February 2026 3 posts
- inference Speculative Decoding in Production: When Draft Tokens Help and When They Hurt A practical guide to speculative decoding: why it speeds up autoregressive generation, how to measure acceptance rate, and when it creates operational complexity.
- ai From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem Why splitting prefill and decode can improve LLM serving, and why the real challenge is KV transfer, topology, scheduling, and SLO-aware operation.
- rust The Rust Case for AI Gateways: Backpressure, Streaming, and Failure Isolation Why an AI gateway sits on a systems boundary where Rust's ownership, async, cancellation, and no-GC profile become practical advantages.
January 2026 4 posts
- ai Why Round-Robin Dies in LLM Serving: KV-Aware Routing Explained A deep but practical explanation of why LLM routing must account for KV cache overlap, prefill cost, decode load, and SLO risk instead of simply rotating requests across workers.
- talent-management What AI-Native Talent Looks Like in 2026: A Recruiter's Field Guide A practical guide for talent leaders on identifying AI-native candidates: workflow literacy, judgment, AI-assisted execution, and human skills that still matter.
- inference TensorRT-LLM vs vLLM vs SGLang: Choosing an Inference Engine for Production A practical comparison of TensorRT-LLM, vLLM, and SGLang across performance, portability, structured generation, cache reuse, deployment, and operations.
- ai Dynamo Is Not an Inference Engine. It Is the Control Plane for Tokens Why NVIDIA Dynamo is best understood as the distributed control plane around LLM inference engines, not as another engine competing with vLLM, SGLang, or TensorRT-LLM.
December 2025 2 posts
- talent-acquisition The AI Hiring Playbook for 2026: Skills, Signals, and Fewer Shiny Job Titles A practical talent-acquisition playbook for hiring in the AI era: skills-first scorecards, better work samples, recruiter judgment, and fewer inflated AI job titles.
- rust Inference Is Not HTTP: The Case for a Purpose-Built Gateway in Rust A systems case for purpose-built inference gateways: token-aware routing, cache affinity, cancellation-safe streaming, Rust, and GPU-aware operations.
November 2025 2 posts
- inference KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs Why LLM load balancing needs cache locality, prefix reuse, token cost, and runtime state instead of simple round-robin.
- inference Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per Second A practical framework for measuring inference economics: tokens per second, TTFT, TPOT, utilization, quality, energy, and cost per useful token.
October 2025 1 posts
September 2025 1 posts
August 2025 1 posts
From the Archive
43 postsEarlier writing on cloud architecture, Kubernetes, Go, and system design — kept here for reference. Hover to bring an item back into focus.