Recent Articles
Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization
Why LLM serving needs autoscaling based on first-token and per-token latency, and how Dynamo Planner points toward SLO-aware capacity control.
From H100 to Blackwell: What Actually Changes for Inference Architects
A practical architecture view of the shift from H100/H200 to Blackwell: memory, precision, NVLink scale-up, MoE, software, and cost per token.
Speculative Decoding in Production: When Draft Tokens Help and When They Hurt
A practical guide to speculative decoding: why it speeds up autoregressive generation, how to measure acceptance rate, and when it creates operational complexity.
From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem
Why splitting prefill and decode can improve LLM serving, and why the real challenge is KV transfer, topology, scheduling, and SLO-aware operation.
The Rust Case for AI Gateways: Backpressure, Streaming, and Failure Isolation
Why an AI gateway sits on a systems boundary where Rust's ownership, async, cancellation, and no-GC profile become practical advantages.
Why Round-Robin Dies in LLM Serving: KV-Aware Routing Explained
A deep but practical explanation of why LLM routing must account for KV cache overlap, prefill cost, decode load, and SLO risk instead of simply rotating requests across workers.
What AI-Native Talent Looks Like in 2026: A Recruiter's Field Guide
A practical guide for talent leaders on identifying AI-native candidates: workflow literacy, judgment, AI-assisted execution, and human skills that still matter.
TensorRT-LLM vs vLLM vs SGLang: Choosing an Inference Engine for Production
A practical comparison of TensorRT-LLM, vLLM, and SGLang across performance, portability, structured generation, cache reuse, deployment, and operations.
Dynamo Is Not an Inference Engine. It Is the Control Plane for Tokens
Why NVIDIA Dynamo is best understood as the distributed control plane around LLM inference engines, not as another engine competing with vLLM, SGLang, or TensorRT-LLM.
The AI Hiring Playbook for 2026: Skills, Signals, and Fewer Shiny Job Titles
A practical talent-acquisition playbook for hiring in the AI era: skills-first scorecards, better work samples, recruiter judgment, and fewer inflated AI job titles.