Skip to content

Ace The Cloud Posts Archive About

CTRL K

CTRL K

Posts
Archive
About

Inference

13/20 - Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model

June 22, 2026

12/20 - Sequence Parallelism: Divide the Tokens, Not the Meaning

June 21, 2026

11/20 - Pipeline Parallelism: Turning Model Depth into an Assembly Line

June 20, 2026

8/20 - Mixed Precision Inference: Spend Bits Where They Matter

June 17, 2026

Production LLM Systems Tutorial 1: End-to-End Application Design

May 9, 2026

Production LLM Systems Tutorial 2: Latency, Cost, and Quality

May 9, 2026

Production LLM Systems Tutorial 3: Scalable Inference Architecture

May 9, 2026

Production LLM Systems Tutorial 9: Cost Optimization

May 9, 2026

Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste

May 6, 2026

Reduce LLM Inference Cost by 60% Without Serving Stale Answers

May 5, 2026

Why Agentic AI Is Bringing CPUs Back Into the Spotlight

May 3, 2026

Agentic AI Needs Smarter Inference: Hints, Priority, and Cache Lifecycle

April 17, 2026

Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production

April 16, 2026

KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform

April 9, 2026

The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You

April 2, 2026

Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization

March 27, 2026

From H100 to Blackwell: What Actually Changes for Inference Architects

March 20, 2026

Speculative Decoding in Production: When Draft Tokens Help and When They Hurt

February 27, 2026

From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem

February 20, 2026

The Rust Case for AI Gateways: Backpressure, Streaming, and Failure Isolation

February 6, 2026

Why Round-Robin Dies in LLM Serving: KV-Aware Routing Explained

January 30, 2026

TensorRT-LLM vs vLLM vs SGLang: Choosing an Inference Engine for Production

January 16, 2026

Dynamo Is Not an Inference Engine. It Is the Control Plane for Tokens

January 9, 2026

Inference Is Not HTTP: The Case for a Purpose-Built Gateway in Rust

December 8, 2025

KV-Aware Routing: How Cache Locality Changes Load Balancing for LLMs

November 21, 2025

Tokenomics for Engineers: Measuring Throughput per Dollar Instead of Tokens per Second

November 7, 2025

Why Agentic Workloads Break Traditional Inference Gateways

October 10, 2025

Disaggregated Inference on Kubernetes: Routing, Scheduling, and Scaling Beyond One GPU

August 29, 2025

Prefill vs Decode: The Hidden Split That Shapes Every LLM Serving Architecture

August 8, 2025

Inference Is a Memory Problem: KV Cache, HBM, and the Real Cost of Long Context

July 18, 2025

gateway · ok · p99 · 187 ms · nodes · 12 / 12 · region · sjc-1 · build · 2026.07

© 2026 AceTheCloud. Independent, non-commercial publication. Views are the author’s own and do not represent current or any past employer.