Skip to content

Llm-Inference

Streaming Generation: The First Token Is a Product Decision

June 25, 2026

Memory Offloading: Trading Bandwidth for Capacity

June 24, 2026

Tensor Parallelism: Splitting One Layer Across Many GPUs

June 19, 2026

Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast

June 18, 2026

Parallel Decoding: Predicting More Than One Future at a Time

June 16, 2026

Early Exit Decoding: Stop Computing Once the Answer Is Clear

June 15, 2026

FlashAttention: Why Moving Fewer Bytes Beats Doing Fewer FLOPs

June 12, 2026

Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge

June 11, 2026

KV Caching: The Memory That Makes Token Generation Possible

June 10, 2026