Engineering AI inference from kernel to cluster
Working notes on AI inference and agents — tokens, tokenomics, NVIDIA GPUs, and the runtime stack that turns model weights into reliable, predictable systems at scale.
Inference Engines
NVIDIA TensorRT-LLM and Dynamo, NIM, vLLM, SGLang — comparing throughput, latency, batching, and memory under realistic load.
LLM Internals
Attention, KV caches, speculative decoding, quantization (FP8, INT4, AWQ, GPTQ), and how they trade accuracy for speed.
GPU Systems
CUDA, NCCL, Triton kernels, NVLink and NVSwitch topology — and what actually limits throughput when scaling past one H100/H200/B200 box.
Tokenomics
Tokens per second, throughput per dollar, GPU utilization, and the cost and capacity math behind production inference.
Agentic Systems
Tool use, function calling, planning, multi-agent orchestration, and the runtime patterns for production agents on top of LLM inference.
The Archive
Earlier work on cloud architecture, Kubernetes, Go, and system design. URLs preserved.
Recent writing
From H100 to Blackwell: What Actually Changes for Inference Architects
A practical architecture view of the shift from H100/H200 to Blackwell: memory, precision, NVLink scale-up, MoE, software, and cost per token.
Speculative Decoding in Production: When Draft Tokens Help and When They Hurt
A practical guide to speculative decoding: why it speeds up autoregressive generation, how to measure acceptance rate, and when it creates operational complexity.
The Rust Case for AI Gateways: Backpressure, Streaming, and Failure Isolation
Why an AI gateway sits on a systems boundary where Rust's ownership, async, cancellation, and no-GC profile become practical advantages.
TensorRT-LLM vs vLLM vs SGLang: Choosing an Inference Engine for Production
A practical comparison of TensorRT-LLM, vLLM, and SGLang across performance, portability, structured generation, cache reuse, deployment, and operations.
Inference Is Not HTTP: The Case for a Purpose-Built Gateway in Rust
A systems case for purpose-built inference gateways: token-aware routing, cache affinity, cancellation-safe streaming, Rust, and GPU-aware operations.
KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs
Why LLM load balancing needs cache locality, prefix reuse, token cost, and runtime state instead of simple round-robin.
Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per Second
A practical framework for measuring inference economics: tokens per second, TTFT, TPOT, utilization, quality, energy, and cost per useful token.
Why Agentic Workloads Break Traditional Inference Gateways
Agentic AI turns one user request into a chain of model calls, tools, retries, memory, and cache pressure. The gateway has to grow up.
Rust for Systems Programming: When the Borrow Checker Earns Its Keep
A practical guide to where Rust earns its complexity: memory safety, concurrency, async services, data planes, and adoption paths for production systems.
Disaggregated Inference on Kubernetes: Routing, Scheduling, and Scaling Beyond One GPU
A Kubernetes-native view of disaggregated LLM inference: prefill pools, decode pools, KV transfer, topology, Gateway API Inference Extension, and GPU-aware scheduling.
From the Archive
43 postsEarlier writing on cloud architecture, Kubernetes, Go, and system design — kept here for reference. Hover to bring an item back into focus.