Engineering TokensModelsAI Apps at Cloud Scale
Working notes on AI inference, LLMs, agents, cloud systems, GPUs, and the runtime stack that turns model weights into reliable, predictable services at scale.
Inference Engines
NVIDIA TensorRT-LLM and Dynamo, NIM, vLLM, SGLang — comparing throughput, latency, batching, and memory under realistic load.
LLM Internals
Attention, KV caches, speculative decoding, quantization (FP8, INT4, AWQ, GPTQ), and how they trade accuracy for speed.
GPU Systems
CUDA, NCCL, Triton kernels, NVLink and NVSwitch topology — and what actually limits throughput when scaling past one H100/H200/B200 box.
Tokenomics
Tokens per second, throughput per dollar, GPU utilization, and the cost and capacity math behind production inference.
Agentic Systems
Tool use, function calling, planning, multi-agent orchestration, and the runtime patterns for production agents on top of LLM inference.
The Archive
Earlier work on cloud architecture, Kubernetes, Go, and system design. URLs preserved.
Recent writing
Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream
A deep guide to separating prefill and decode workers, KV transfer, independent scaling, routing, RDMA, conditional disaggregation, and SLOs.
Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else
How chunked prefill divides long prompts, co-schedules them with decode, controls TTFT and TPOT interference, and chooses token budgets.
Continuous Batching: The GPU Schedule That Never Stands Still
How iteration-level scheduling adds and removes LLM requests between decode steps, improving utilization while preserving fairness and latency.
Latest notes
17 more posts- 2026 streaming Streaming Generation: The First Token Is a Product Decision How to design token streaming from model to browser with SSE or gRPC, buffering, cancellation, backpressure, usage accounting, and terminal events.
- 2026 memory-offloading Memory Offloading: Trading Bandwidth for Capacity A systems guide to moving weights and KV cache between GPU, CPU, and NVMe, including prefetch, pinning, overlap, and latency tradeoffs.
- 2026 dynamic-batching Dynamic Batching: Waiting Microseconds to Save Milliseconds How dynamic batchers collect live requests, select batch sizes, manage queue delay, priorities, shapes, and throughput-latency tradeoffs.
- 2026 onnx Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model How constant folding, node elimination, fusion, layout selection, precision propagation, and engine building optimize inference graphs.
- 2026 sequence-parallelism Sequence Parallelism: Divide the Tokens, Not the Meaning A careful guide to sequence parallelism, its relationship to tensor and context parallelism, communication patterns, and long-context inference.
- 2026 pipeline-parallelism Pipeline Parallelism: Turning Model Depth into an Assembly Line How pipeline parallelism splits transformer layers across GPUs, where pipeline bubbles come from, and when it helps or hurts LLM inference.
- 2026 tensor-parallelism Tensor Parallelism: Splitting One Layer Across Many GPUs A rigorous but approachable guide to column and row tensor parallelism, collectives, topology, latency, and production sizing for LLM inference.
- 2026 quantization Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast How packing, dequantization, scaling, fused GEMM, group size, and hardware support determine whether quantized LLMs deliver real speedups.
- 2026 mixed-precision Mixed Precision Inference: Spend Bits Where They Matter A practical mental model for FP32, BF16, FP16, FP8, FP4, accumulation precision, scaling, calibration, and mixed-precision validation.
- 2026 parallel-decoding Parallel Decoding: Predicting More Than One Future at a Time A clear guide to multi-token heads, candidate trees, Medusa-style verification, and the difference between parallel and speculative decoding.
- 2026 operations The Human Control Plane: What a VP Operations Must Build in an AI Company A practical operating guide for the VP Operations role in an AI company: cadence, metrics, talent, cost discipline, customer trust, governance, and execution at speed.
- 2026 early-exit Early Exit Decoding: Stop Computing Once the Answer Is Clear How early-exit language models use intermediate layers, confidence, and full-model verification to reduce inference work without hiding quality risk.
- 2026 batch-inference Batch Inference: When Throughput Matters More Than Immediacy A practical guide to offline LLM batch inference, job design, bucketing, retries, idempotency, cost, and throughput-oriented scheduling.
- 2026 pagedattention PagedAttention: Virtual Memory for the KV Cache How PagedAttention applies block-based virtual-memory ideas to dynamic KV caches, improving utilization, sharing, and serving throughput.
- 2026 flashattention FlashAttention: Why Moving Fewer Bytes Beats Doing Fewer FLOPs An intuitive and rigorous explanation of FlashAttention, tiling, online softmax, IO awareness, exactness, and practical deployment limits.
- 2026 speculative-decoding Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge How speculative decoding accelerates autoregressive generation while preserving the target model distribution, with acceptance math and production tradeoffs.
- 2026 kv-cache KV Caching: The Memory That Makes Token Generation Possible A beginner-friendly and technically deep guide to KV caching, memory sizing, prefix reuse, eviction, isolation, and production metrics.
From the Archive
43 postsEarlier writing on cloud architecture, Kubernetes, Go, and system design — kept here for reference. Hover to bring an item back into focus.