Recent Articles
Inference Is Not HTTP: The Case for a Purpose-Built Gateway in Rust
A systems case for purpose-built inference gateways: token-aware routing, cache affinity, cancellation-safe streaming, Rust, and GPU-aware operations.
KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs
Why LLM load balancing needs cache locality, prefix reuse, token cost, and runtime state instead of simple round-robin.
Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per Second
A practical framework for measuring inference economics: tokens per second, TTFT, TPOT, utilization, quality, energy, and cost per useful token.
Why Agentic Workloads Break Traditional Inference Gateways
Agentic AI turns one user request into a chain of model calls, tools, retries, memory, and cache pressure. The gateway has to grow up.
Rust for Systems Programming: When the Borrow Checker Earns Its Keep
A practical guide to where Rust earns its complexity: memory safety, concurrency, async services, data planes, and adoption paths for production systems.
Disaggregated Inference on Kubernetes: Routing, Scheduling, and Scaling Beyond One GPU
A Kubernetes-native view of disaggregated LLM inference: prefill pools, decode pools, KV transfer, topology, Gateway API Inference Extension, and GPU-aware scheduling.
Prefill vs Decode: The Hidden Split That Shapes Every LLM Serving Architecture
Why LLM inference has two very different phases, and why the prefill/decode split changes batching, routing, SLOs, and GPU utilization.
Inference Is a Memory Problem: KV Cache, HBM, and the Real Cost of Long Context
A practical tour of why long-context LLM inference is governed less by raw FLOPS and more by KV cache, HBM capacity, and memory traffic.