Skip to content

The Fast Path

Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream

June 28, 2026

Chunked Prefill: How to Stop One Long Prompt from Freezing Everyone Else

June 27, 2026

Continuous Batching: The GPU Schedule That Never Stands Still

June 26, 2026

Streaming Generation: The First Token Is a Product Decision

June 25, 2026

Memory Offloading: Trading Bandwidth for Capacity

June 24, 2026

Dynamic Batching: Waiting Microseconds to Save Milliseconds

June 23, 2026

Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model

June 22, 2026

Sequence Parallelism: Divide the Tokens, Not the Meaning

June 21, 2026

Pipeline Parallelism: Turning Model Depth into an Assembly Line

June 20, 2026

Tensor Parallelism: Splitting One Layer Across Many GPUs

June 19, 2026

Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast

June 18, 2026

Mixed Precision Inference: Spend Bits Where They Matter

June 17, 2026

Parallel Decoding: Predicting More Than One Future at a Time

June 16, 2026

Early Exit Decoding: Stop Computing Once the Answer Is Clear

June 15, 2026

Batch Inference: When Throughput Matters More Than Immediacy

June 14, 2026

PagedAttention: Virtual Memory for the KV Cache

June 13, 2026

FlashAttention: Why Moving Fewer Bytes Beats Doing Fewer FLOPs

June 12, 2026

Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge

June 11, 2026

KV Caching: The Memory That Makes Token Generation Possible

June 10, 2026