Skip to content

Recent Articles

38 posts · sorted by date
May 6, 2026 4 min · read

Agents Need Seatbelts: Guardrails and Infinite-Loop Detection for Tool-Using AI

A systems guide to agent guardrails, tool authorization, loop detection, budgets, recursion limits, no-progress checks, and production safety controls.

May 6, 2026 5 min · read

Your Token Bill Has a Leak: Cost Monitoring for Hidden LLM Waste

A production guide to hidden token leaks: retries, tool loops, unused RAG context, schema bloat, cached-token accounting, reasoning tokens, and cost observability.

May 5, 2026 15 min · read

Reduce LLM Inference Cost by 60% Without Serving Stale Answers

A production architecture for semantic response caching, freshness gates, prompt/KV reuse, and cache-aware routing when users ask the same questions in different words.

May 3, 2026 11 min · read

Why Agentic AI Is Bringing CPUs Back Into the Spotlight

Why agentic AI makes CPUs more important again, and why modern CPU+GPU platforms are starting to look like the natural shape of the AI cloud.

April 29, 2026 11 min · read

YC's 2026 Startup Map: AI Has Left the Chatbox

A talent and market-trends read on Y Combinator's latest Requests for Startups: AI-native services, company brains, agent software, chips, hardware, medicine, agriculture, space, and enterprise workflows.

April 23, 2026 5 min · read

Your RAG Demo Passed. Your RAG System Needs a Judge: RAGAS, Humans, and Evidence

A deep guide to evaluating RAG systems with RAGAS, retrieval metrics, faithfulness checks, human evals, golden datasets, and production feedback loops.

April 17, 2026 10 min · read

Agentic AI Needs Smarter Inference: Hints, Priority, and Cache Lifecycle

Why agentic workloads need inference runtimes that understand priority, expected output length, speculative prefill, and KV cache lifecycle instead of treating every request the same.

April 16, 2026 5 min · read

Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production

A practical comparison of speculative decoding and quantization: what each optimizes, where each fails, how they interact, and what to measure before rollout.

April 9, 2026 6 min · read

KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform

A deep systems guide to KV cache capacity, PagedAttention, prefix reuse, eviction, offload, routing, and fleet-level cache management for LLM inference.

April 2, 2026 6 min · read

The Cache Has Layers: Prompt Caching, Semantic Caching, and When Each One Betrays You

A production guide to prompt caching, context caching, semantic response caching, exact caching, and the tradeoffs that decide latency, cost, freshness, and correctness.