Skip to content

Ace The Cloud Posts Archive About

CTRL K

CTRL K

Posts
Archive
About

Llm-Inference

12/21 - Cache the Right Thing: Prompt, Semantic, and Cost-Aware Reuse

July 12, 2026

2/21 - Smaller Numbers, Faster Models: Quantization and Batching

July 2, 2026

20/20 - Expert Parallelism: Routing Tokens Through a City of Specialists

June 30, 2026

16/20 - Streaming Generation: The First Token Is a Product Decision

June 25, 2026

15/20 - Memory Offloading: Trading Bandwidth for Capacity

June 24, 2026

10/20 - Tensor Parallelism: Splitting One Layer Across Many GPUs

June 19, 2026

9/20 - Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast

June 18, 2026

7/20 - Parallel Decoding: Predicting More Than One Future at a Time

June 16, 2026

6/20 - Early Exit Decoding: Stop Computing Once the Answer Is Clear

June 15, 2026

3/20 - FlashAttention: Why Moving Fewer Bytes Beats Doing Fewer FLOPs

June 12, 2026

2/20 - Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge

June 11, 2026

1/20 - KV Caching: The Memory That Makes Token Generation Possible

June 10, 2026

gateway · ok · p99 · 187 ms · nodes · 12 / 12 · region · sjc-1 · build · 2026.07

© 2026 AceTheCloud. Independent, non-commercial publication. Views are the author’s own and do not represent current or any past employer.