Engineering TokensModelsAI Apps at Cloud Scale
Working notes on AI inference, LLMs, agents, cloud systems, GPUs, and the runtime stack that turns model weights into reliable, predictable services at scale.
Inference Engines
NVIDIA TensorRT-LLM and Dynamo, NIM, vLLM, SGLang — comparing throughput, latency, batching, and memory under realistic load.
LLM Internals
Attention, KV caches, speculative decoding, quantization (FP8, INT4, AWQ, GPTQ), and how they trade accuracy for speed.
GPU Systems
CUDA, NCCL, Triton kernels, NVLink and NVSwitch topology — and what actually limits throughput when scaling past one H100/H200/B200 box.
Tokenomics
Tokens per second, throughput per dollar, GPU utilization, and the cost and capacity math behind production inference.
Agentic Systems
Tool use, function calling, planning, multi-agent orchestration, and the runtime patterns for production agents on top of LLM inference.
The Archive
Earlier work on cloud architecture, Kubernetes, Go, and system design. URLs preserved.
Recent writing
21/21 - Below PyTorch: Profiling, Compilation, and CUDA Kernel Optimization
A production-focused guide to below pytorch: profiling, compilation, and cuda kernel optimization, with architecture, capacity math, failure analysis, and operational controls.
20/21 - The Ground Beneath AI: Linux, Networking, and Storage
A production-focused guide to the ground beneath ai: linux, networking, and storage, with architecture, capacity math, failure analysis, and operational controls.
19/21 - Shipping Models Like Software: CI/CD, MLflow, and Registries
A production-focused guide to shipping models like software: ci/cd, mlflow, and registries, with architecture, capacity math, failure analysis, and operational controls.
Latest notes
17 more posts- 2026 AI Security 18/21 - Assume the Prompt Is Hostile: Security and Guardrails A production-focused guide to assume the prompt is hostile: security and guardrails, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Kafka 17/21 - From Kafka to Tokens: Streaming Data and Online Inference A production-focused guide to from kafka to tokens: streaming data and online inference, with architecture, capacity math, failure analysis, and operational controls.
- 2026 MCP 16/21 - Agents Need Infrastructure Too: MCP and Workflow Orchestration A production-focused guide to agents need infrastructure too: mcp and workflow orchestration, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Model Routing 15/21 - The Router Is Part of the Model: Routing, Hedging, and Fallback A production-focused guide to the router is part of the model: routing, hedging, and fallback, with architecture, capacity math, failure analysis, and operational controls.
- 2026 LLM Evaluation 14/21 - Benchmarking Without Lying: Evals, Load Tests, and A/B Experiments A production-focused guide to benchmarking without lying: evals, load tests, and a/b experiments, with architecture, capacity math, failure analysis, and operational controls.
- 2026 AI Observability 13/21 - Can You Debug a Token? Observability for AI Systems A production-focused guide to can you debug a token? observability for ai systems, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Prompt Caching 12/21 - Cache the Right Thing: Prompt, Semantic, and Cost-Aware Reuse A production-focused guide to cache the right thing: prompt, semantic, and cost-aware reuse, with architecture, capacity math, failure analysis, and operational controls.
- 2026 RAG 11/21 - RAG That Survives Production: Embeddings, Retrieval, and Evidence A production-focused guide to rag that survives production: embeddings, retrieval, and evidence, with architecture, capacity math, failure analysis, and operational controls.
- 2026 LoRA 10/21 - Fine-Tuning Without the Full Bill: LoRA, QLoRA, and PEFT A production-focused guide to fine-tuning without the full bill: lora, qlora, and peft, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Multi-GPU Inference 9/21 - One Model, Many Accelerators: Multi-GPU and Multi-Node Inference A production-focused guide to one model, many accelerators: multi-gpu and multi-node inference, with architecture, capacity math, failure analysis, and operational controls.
- 2026 NCCL 8/21 - The Fabric Between GPUs: NCCL, InfiniBand, RoCE, and GPUDirect A production-focused guide to the fabric between gpus: nccl, infiniband, roce, and gpudirect, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Kubernetes 7/21 - Kubernetes Meets GPUs: Containers, Scheduling, and Isolation A production-focused guide to kubernetes meets gpus: containers, scheduling, and isolation, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Model Serving 6/21 - The Serving Layer: Triton, vLLM, KServe, Ray Serve, and SGLang A production-focused guide to the serving layer: triton, vllm, kserve, ray serve, and sglang, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Distributed Training 5/21 - Training Across the Fleet: DDP, FSDP, DeepSpeed, and ZeRO A production-focused guide to training across the fleet: ddp, fsdp, deepspeed, and zero, with architecture, capacity math, failure analysis, and operational controls.
- 2026 KV Cache 4/21 - The Memory of a Conversation: KV, Prefix Reuse, Speculation, and Throughput A production-focused guide to the memory of a conversation: kv, prefix reuse, speculation, and throughput, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Inference Engines 3/21 - The Inference Engine Room: vLLM, TensorRT-LLM, SGLang, and llama.cpp A production-focused guide to the inference engine room: vllm, tensorrt-llm, sglang, and llama.cpp, with architecture, capacity math, failure analysis, and operational controls.
- 2026 Quantization 2/21 - Smaller Numbers, Faster Models: Quantization and Batching How INT8, FP8, 4-bit formats, static batching, dynamic batching, and continuous batching change memory, kernels, quality, and latency.
From the Archive
43 postsEarlier writing on cloud architecture, Kubernetes, Go, and system design — kept here for reference. Hover to bring an item back into focus.