Skip to content

Recent Articles

38 posts · sorted by date
December 8, 2025 14 min · read

Inference Is Not HTTP: The Case for a Purpose-Built Gateway in Rust

A systems case for purpose-built inference gateways: token-aware routing, cache affinity, cancellation-safe streaming, Rust, and GPU-aware operations.

November 21, 2025 6 min · read

KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs

Why LLM load balancing needs cache locality, prefix reuse, token cost, and runtime state instead of simple round-robin.

November 7, 2025 6 min · read

Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per Second

A practical framework for measuring inference economics: tokens per second, TTFT, TPOT, utilization, quality, energy, and cost per useful token.

October 10, 2025 6 min · read

Why Agentic Workloads Break Traditional Inference Gateways

Agentic AI turns one user request into a chain of model calls, tools, retries, memory, and cache pressure. The gateway has to grow up.

September 15, 2025 9 min · read

Rust for Systems Programming: When the Borrow Checker Earns Its Keep

A practical guide to where Rust earns its complexity: memory safety, concurrency, async services, data planes, and adoption paths for production systems.

August 29, 2025 6 min · read

Disaggregated Inference on Kubernetes: Routing, Scheduling, and Scaling Beyond One GPU

A Kubernetes-native view of disaggregated LLM inference: prefill pools, decode pools, KV transfer, topology, Gateway API Inference Extension, and GPU-aware scheduling.

August 8, 2025 6 min · read

Prefill vs Decode: The Hidden Split That Shapes Every LLM Serving Architecture

Why LLM inference has two very different phases, and why the prefill/decode split changes batching, routing, SLOs, and GPU utilization.

July 18, 2025 6 min · read

Inference Is a Memory Problem: KV Cache, HBM, and the Real Cost of Long Context

A practical tour of why long-context LLM inference is governed less by raw FLOPS and more by KV cache, HBM capacity, and memory traffic.