AI Infrastructure Engineer21/21 - Below PyTorch: Profiling, Compilation, and CUDA Kernel OptimizationJuly 21, 202620/21 - The Ground Beneath AI: Linux, Networking, and StorageJuly 20, 202619/21 - Shipping Models Like Software: CI/CD, MLflow, and RegistriesJuly 19, 202618/21 - Assume the Prompt Is Hostile: Security and GuardrailsJuly 18, 202617/21 - From Kafka to Tokens: Streaming Data and Online InferenceJuly 17, 202616/21 - Agents Need Infrastructure Too: MCP and Workflow OrchestrationJuly 16, 202615/21 - The Router Is Part of the Model: Routing, Hedging, and FallbackJuly 15, 202614/21 - Benchmarking Without Lying: Evals, Load Tests, and A/B ExperimentsJuly 14, 202613/21 - Can You Debug a Token? Observability for AI SystemsJuly 13, 202612/21 - Cache the Right Thing: Prompt, Semantic, and Cost-Aware ReuseJuly 12, 202611/21 - RAG That Survives Production: Embeddings, Retrieval, and EvidenceJuly 11, 202610/21 - Fine-Tuning Without the Full Bill: LoRA, QLoRA, and PEFTJuly 10, 20269/21 - One Model, Many Accelerators: Multi-GPU and Multi-Node InferenceJuly 9, 20268/21 - The Fabric Between GPUs: NCCL, InfiniBand, RoCE, and GPUDirectJuly 8, 20267/21 - Kubernetes Meets GPUs: Containers, Scheduling, and IsolationJuly 7, 20266/21 - The Serving Layer: Triton, vLLM, KServe, Ray Serve, and SGLangJuly 6, 20265/21 - Training Across the Fleet: DDP, FSDP, DeepSpeed, and ZeROJuly 5, 20264/21 - The Memory of a Conversation: KV, Prefix Reuse, Speculation, and ThroughputJuly 4, 20263/21 - The Inference Engine Room: vLLM, TensorRT-LLM, SGLang, and llama.cppJuly 3, 20262/21 - Smaller Numbers, Faster Models: Quantization and BatchingJuly 2, 20261/21 - Inside the GPU: From SMs to HBM Without the Hand-WavingJuly 1, 2026