Skip to content

AI Infrastructure Engineer

21/21 - Below PyTorch: Profiling, Compilation, and CUDA Kernel Optimization

July 21, 2026

20/21 - The Ground Beneath AI: Linux, Networking, and Storage

July 20, 2026

19/21 - Shipping Models Like Software: CI/CD, MLflow, and Registries

July 19, 2026

18/21 - Assume the Prompt Is Hostile: Security and Guardrails

July 18, 2026

17/21 - From Kafka to Tokens: Streaming Data and Online Inference

July 17, 2026

16/21 - Agents Need Infrastructure Too: MCP and Workflow Orchestration

July 16, 2026

15/21 - The Router Is Part of the Model: Routing, Hedging, and Fallback

July 15, 2026

14/21 - Benchmarking Without Lying: Evals, Load Tests, and A/B Experiments

July 14, 2026

13/21 - Can You Debug a Token? Observability for AI Systems

July 13, 2026

12/21 - Cache the Right Thing: Prompt, Semantic, and Cost-Aware Reuse

July 12, 2026

11/21 - RAG That Survives Production: Embeddings, Retrieval, and Evidence

July 11, 2026

10/21 - Fine-Tuning Without the Full Bill: LoRA, QLoRA, and PEFT

July 10, 2026

9/21 - One Model, Many Accelerators: Multi-GPU and Multi-Node Inference

July 9, 2026

8/21 - The Fabric Between GPUs: NCCL, InfiniBand, RoCE, and GPUDirect

July 8, 2026

7/21 - Kubernetes Meets GPUs: Containers, Scheduling, and Isolation

July 7, 2026

6/21 - The Serving Layer: Triton, vLLM, KServe, Ray Serve, and SGLang

July 6, 2026

5/21 - Training Across the Fleet: DDP, FSDP, DeepSpeed, and ZeRO

July 5, 2026

4/21 - The Memory of a Conversation: KV, Prefix Reuse, Speculation, and Throughput

July 4, 2026

3/21 - The Inference Engine Room: vLLM, TensorRT-LLM, SGLang, and llama.cpp

July 3, 2026

2/21 - Smaller Numbers, Faster Models: Quantization and Batching

July 2, 2026

1/21 - Inside the GPU: From SMs to HBM Without the Hand-Waving

July 1, 2026