GpuProduction LLM Systems Tutorial 3: Scalable Inference ArchitectureMay 9, 2026Why Agentic AI Is Bringing CPUs Back Into the SpotlightMay 3, 2026Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in ProductionApril 16, 2026From H100 to Blackwell: What Actually Changes for Inference ArchitectsMarch 20, 2026The Rust Case for AI Gateways: Backpressure, Streaming, and Failure IsolationFebruary 6, 2026TensorRT-LLM vs vLLM vs SGLang: Choosing an Inference Engine for ProductionJanuary 16, 2026Inference Is Not HTTP: The Case for a Purpose-Built Gateway in RustDecember 8, 2025Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per SecondNovember 7, 2025Disaggregated Inference on Kubernetes: Routing, Scheduling, and Scaling Beyond One GPUAugust 29, 2025Prefill vs Decode: The Hidden Split That Shapes Every LLM Serving ArchitectureAugust 8, 2025Inference Is a Memory Problem: KV Cache, HBM, and the Real Cost of Long ContextJuly 18, 2025