Skip to content

Recent Articles

39 posts · sorted by date
April 2, 2026 6 min · read

The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You

A production guide to prompt caching, context caching, semantic response caching, exact caching, and the tradeoffs that decide latency, cost, freshness, and correctness.

March 27, 2026 8 min · read

Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization

Why LLM serving needs autoscaling based on first-token and per-token latency, and how Dynamo Planner points toward SLO-aware capacity control.

March 20, 2026 6 min · read

From H100 to Blackwell: What Actually Changes for Inference Architects

A practical architecture view of the shift from H100/H200 to Blackwell: memory, precision, NVLink scale-up, MoE, software, and cost per token.

February 27, 2026 5 min · read

Speculative Decoding in Production: When Draft Tokens Help and When They Hurt

A practical guide to speculative decoding: why it speeds up autoregressive generation, how to measure acceptance rate, and when it creates operational complexity.

February 20, 2026 8 min · read

From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem

Why splitting prefill and decode can improve LLM serving, and why the real challenge is KV transfer, topology, scheduling, and SLO-aware operation.

February 6, 2026 5 min · read

The Rust Case for AI Gateways: Backpressure, Streaming, and Failure Isolation

Why an AI gateway sits on a systems boundary where Rust's ownership, async, cancellation, and no-GC profile become practical advantages.

January 30, 2026 8 min · read

Why Round-Robin Dies in LLM Serving: KV-Aware Routing Explained

A deep but practical explanation of why LLM routing must account for KV cache overlap, prefill cost, decode load, and SLO risk instead of simply rotating requests across workers.

January 23, 2026 10 min · read

What AI-Native Talent Looks Like in 2026: A Recruiter's Field Guide

A practical guide for talent leaders on identifying AI-native candidates: workflow literacy, judgment, AI-assisted execution, and human skills that still matter.

January 16, 2026 6 min · read

TensorRT-LLM vs vLLM vs SGLang: Choosing an Inference Engine for Production

A practical comparison of TensorRT-LLM, vLLM, and SGLang across performance, portability, structured generation, cache reuse, deployment, and operations.

January 9, 2026 8 min · read

Dynamo Is Not an Inference Engine. It Is the Control Plane for Tokens

Why NVIDIA Dynamo is best understood as the distributed control plane around LLM inference engines, not as another engine competing with vLLM, SGLang, or TensorRT-LLM.