Engineering the AI cloud, one token at a time.
Working notes on AI inference, LLMs, agents, cloud systems, GPUs, and the runtime stack that turns model weights into reliable, predictable services at scale.
Inference Engines
NVIDIA TensorRT-LLM and Dynamo, NIM, vLLM, SGLang — comparing throughput, latency, batching, and memory under realistic load.
LLM Internals
Attention, KV caches, speculative decoding, quantization (FP8, INT4, AWQ, GPTQ), and how they trade accuracy for speed.
GPU Systems
CUDA, NCCL, Triton kernels, NVLink and NVSwitch topology — and what actually limits throughput when scaling past one H100/H200/B200 box.
Tokenomics
Tokens per second, throughput per dollar, GPU utilization, and the cost and capacity math behind production inference.
Agentic Systems
Tool use, function calling, planning, multi-agent orchestration, and the runtime patterns for production agents on top of LLM inference.
The Archive
Earlier work on cloud architecture, Kubernetes, Go, and system design. URLs preserved.
Recent writing
Why Agentic AI Is Bringing CPUs Back Into the Spotlight
Why agentic AI makes CPUs more important again, and why modern CPU+GPU platforms are starting to look like the natural shape of the AI cloud.
YC's 2026 Startup Map: AI Has Left the Chatbox
A talent and market-trends read on Y Combinator's latest Requests for Startups: AI-native services, company brains, agent software, chips, hardware, medicine, agriculture, space, and enterprise workflows.
Agentic AI Needs Smarter Inference: Hints, Priority, and Cache Lifecycle
Why agentic workloads need inference runtimes that understand priority, expected output length, speculative prefill, and KV cache lifecycle instead of treating every request the same.
Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization
Why LLM serving needs autoscaling based on first-token and per-token latency, and how Dynamo Planner points toward SLO-aware capacity control.
From H100 to Blackwell: What Actually Changes for Inference Architects
A practical architecture view of the shift from H100/H200 to Blackwell: memory, precision, NVLink scale-up, MoE, software, and cost per token.
Speculative Decoding in Production: When Draft Tokens Help and When They Hurt
A practical guide to speculative decoding: why it speeds up autoregressive generation, how to measure acceptance rate, and when it creates operational complexity.
From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem
Why splitting prefill and decode can improve LLM serving, and why the real challenge is KV transfer, topology, scheduling, and SLO-aware operation.
The Rust Case for AI Gateways: Backpressure, Streaming, and Failure Isolation
Why an AI gateway sits on a systems boundary where Rust's ownership, async, cancellation, and no-GC profile become practical advantages.
Why Round-Robin Dies in LLM Serving: KV-Aware Routing Explained
A deep but practical explanation of why LLM routing must account for KV cache overlap, prefill cost, decode load, and SLO risk instead of simply rotating requests across workers.
What AI-Native Talent Looks Like in 2026: A Recruiter's Field Guide
A practical guide for talent leaders on identifying AI-native candidates: workflow literacy, judgment, AI-assisted execution, and human skills that still matter.
From the Archive
43 postsEarlier writing on cloud architecture, Kubernetes, Go, and system design — kept here for reference. Hover to bring an item back into focus.