Why Agentic AI Is Bringing CPUs Back Into the Spotlight

May 3, 202611 min · readRECENTS ARTICLE

Why Agentic AI Is Bringing CPUs Back Into the Spotlight

#ai #inference #agentic-ai #gpu #cpu #blackwell #gb300 #grace #rubin #vera

A funny thing is happening in AI infrastructure conversations: the CPU has started walking back into the room with confidence.

Not in the old “the CPU runs everything” way. That era is gone for frontier AI. The GPU is still where most of the expensive tensor math happens. It is still the engine that turns parameters, attention, and activations into tokens.

But as workloads move from training to inference, and from inference to agentic systems, the shape of useful work changes. The model does not just consume a batch and produce a loss. It waits for users. It streams. It calls tools. It retrieves context. It validates. It retries. It manages memory. It enforces policy. It handles tenants. It watches budgets. It decides whether a request should go to a fast path, a cheap path, a safe path, or no path at all.

That work is not “just overhead.” In production inference, it is the difference between an AI demo and an AI service.

So yes, the CPU is back.

The twist is that this does not make the GPU story weaker. It makes the full-system story much more important.

The ratio sketch is directionally right

The image that kicked off this post makes a useful argument: the GPU-to-CPU ratio gets less extreme as AI moves through three eras.

Training, inference, and agentic AI shift the useful CPU role from GPU-heavy training to CPU-aware agentic systems — The ratios are directional, not laws of physics. The workload decides the real balance.

In the training era, the center of gravity was obvious. You wanted as much GPU compute, HBM capacity, and interconnect bandwidth as the budget and power envelope allowed. CPUs mattered, but mostly as hosts and feeders. If the GPUs were not busy, the system was failing.

Inference made the CPU more visible. Serving is not a single giant matrix operation. It is thousands or millions of uneven requests fighting for latency, fairness, memory, and cost. Decode is sequential. KV cache grows. Sessions become sticky. Batch shape changes every few milliseconds. Routing becomes a cost decision. Suddenly, the host side matters more.

Agentic AI turns the dial again. Agents do not just generate a response. They run loops:

plan
retrieve
call a tool
inspect output
revise the plan
call another tool
verify
stream a response
save state
maybe do it all again

That loop is CPU-rich. It is full of orchestration, state machines, security checks, JSON parsing, tool authorization, sandboxing, queue management, and observability. The GPU still does the model math, but the CPU increasingly decides what work reaches the GPU, when it reaches it, and whether it is allowed to happen.

That is the real story. Not “CPU beats GPU.” More like “the control plane is now part of the product.”

What Intel got right

Intel’s April 23, 2026 Q1 prepared remarks are useful because they say the quiet part plainly: server CPU demand is being pulled forward by the evolution of AI “from foundational training to inference and from inference to agentic.” Intel also called out host CPU roles around memory, security, and networking orchestration.

That is directionally correct.

As AI systems become services instead of experiments, CPUs do more of the work that customers actually feel:

Production concern	Why CPU-side work grows
Multi-tenant fairness	Schedulers need to protect queues, quotas, and latency classes
Tool use	Agents need auth, API calls, sandboxes, serialization, and retries
Retrieval	Search, ranking, chunking, metadata filtering, and cache lookups sit around the model
KV cache management	Long-context and multi-turn systems need memory bookkeeping and placement
Safety and policy	Prompt security, PII checks, tool authorization, and audit logs run outside the model
Observability	Token SLOs, traces, error reasons, and cost metrics are control-plane work

This does not make GPUs less important. It makes the system around the GPU more important.

And that is where the argument starts to get interesting.

The platform was already moving this way

It is tempting to describe this transition as “GPUs got less important.” That misses the actual architecture move.

The leading AI infrastructure platforms have been moving from accelerator cards to complete AI factories: GPU, CPU, memory, NVLink, networking, DPUs, rack-scale management, and software. Grace Blackwell and GB300 are not just faster chips. They are a statement about where the unit of design is moving.

The unit is becoming the rack.

The GB300 NVL72 product page describes a fully liquid-cooled rack-scale platform with 72 Blackwell Ultra GPUs and 36 Grace CPUs. The same page lists 130 TB/s of NVLink bandwidth and 37 TB of fast memory across the system. Microsoft’s Azure GB300 announcement uses the same rack-level framing and says its ND GB300 v6 VMs are optimized for reasoning models, agentic AI systems, and multimodal generative AI.

That is not a training-era 8:1 world. That is a 2-GPU-per-Grace-CPU rack, tied together by a very serious scale-up fabric.

GB300 NVL72 rack-scale architecture with 36 Grace CPUs, 72 Blackwell Ultra GPUs, NVLink, fast memory, and scale-out networking — GB300 is a good example of the platform shift: CPUs, GPUs, memory, and fabric designed as one rack-scale system.

This is the part I think gets underappreciated. If the CPU becomes more important in agentic inference, the best platforms are not caught flat-footed. Grace is already in the system. NVLink-C2C made CPU-GPU coupling a design feature, not an afterthought. NVLink and NVSwitch made scale-up domains bigger. ConnectX, Quantum-X, Spectrum-X, BlueField, Dynamo, NIM, TensorRT-LLM, and Mission Control all point in the same direction:

The competitive object is not a GPU. It is the whole path from request to useful token.

That is the right answer to the CPU comeback: “Good. Put the CPU inside the AI factory too.”

Why GB300 matters for agentic inference

Reasoning models and agents stretch inference in two uncomfortable ways.

First, the request is longer lived. A classic chat request can already be slow, but an agent may run a multi-step workflow with intermediate model calls, retrieval, tool calls, validations, and fallbacks. The user sees one task. The platform sees a graph.

Second, the platform has to keep GPUs useful while the graph waits on non-GPU work. A model call may be followed by a database lookup, a browser step, a vector search, a policy check, or a tool response. If the runtime is naive, the GPU ends up waiting on the world. If the runtime is good, CPU-side orchestration and GPU-side generation overlap cleanly.

That is why the GB300 shape is meaningful:

Grace CPUs handle host-side scheduling, memory movement, runtime services, and control-plane logic.
Blackwell Ultra GPUs handle the expensive model math.
NVLink keeps the rack-scale GPU domain fast enough for larger models and long-context workloads.
Fast memory gives the system more room for weights, KV cache, and active sessions.
High-bandwidth networking lets the AI factory scale beyond a single rack.
Software such as Dynamo and Mission Control matters because the hardware needs an operating model.

This is also why I am skeptical of infrastructure discussions that reduce everything to peak FLOPS. Peak FLOPS are useful the way horsepower is useful. You still need tires, steering, cooling, brakes, roads, and a driver who knows where the vehicle is going.

For agentic AI, the “driver” is often the CPU-side runtime.

The CPU does not replace the GPU. It protects GPU utilization.

The best way to think about the CPU in modern inference is not as a rival. Think of it as the stage manager.

The GPU is on stage doing the expensive performance. The CPU makes sure the right actor walks on at the right time, the props are present, the lights turn on, nobody trips over the cables, and the audience does not notice the chaos backstage.

In less theatrical terms, the CPU protects GPU utilization by handling:

request admission
token budget estimation
batching decisions
model and backend selection
prefix and KV-cache routing
stream cancellation
tool-call authorization
prompt security checks
policy enforcement
telemetry emission
retry and fallback logic
tenant cost accounting

None of that work creates tokens directly. All of it decides whether token generation is fast, safe, cheap, and predictable.

The brutal truth is that bad CPU-side orchestration can make a beautiful GPU cluster look mediocre. You can buy excellent accelerators and still lose on tail latency because requests are routed like web traffic, KV cache locality is ignored, tool calls block the wrong queues, or cancellation does not reach the backend.

That is not a GPU problem. That is an architecture problem.

Vera Rubin is the clearer signal

GB300 is the near-term example. Vera Rubin is the stronger long-term clue.

The naming is easy to mix up, so let us be precise: Vera is the CPU. Rubin is the GPU. The Vera Rubin platform is the co-designed system around them.

The Vera CPU page positions it as purpose-built for reinforcement learning and agentic AI, with the CPU directing data movement, managing memory, and orchestrating system control. The Vera Rubin platform page describes Vera Rubin NVL72 as unifying 72 Rubin GPUs, 36 Vera CPUs, ConnectX-9 SuperNICs, and BlueField-4 DPUs, with sixth-generation NVLink and NVLink Switch.

That is the CPU comeback written directly into the roadmap.

Vera Rubin platform diagram showing Vera CPUs, Rubin GPUs, NVLink 6, ConnectX-9, BlueField-4, and agentic AI workflow responsibilities — Vera Rubin makes the architecture explicit: agentic AI is a CPU plus GPU plus fabric problem.

The interesting thing about Vera Rubin is not just that the CPU is present. It is that the CPU is described as part of the agentic AI path.

That matters because agentic systems create a lot of work outside the model:

thousands of parallel software environments for reinforcement learning
tool execution and validation
memory-heavy data processing
KV-cache management
orchestration across GPU and non-GPU work
security boundaries around proprietary data and models
reliable scale-out networking

In a world like that, a CPU designed merely as a generic host is not enough. You want the CPU, GPU, memory, and fabric designed together. The framing around Vera Rubin is exactly that: treat the data center, not the chip, as the unit of compute.

It is a neat strategic move. The more people say “AI needs CPUs again,” the more Grace today and Vera next look like they were aimed at this moment all along.

What changes in real architecture reviews

If you are designing AI platforms, this CPU-GPU shift should change the questions you ask.

Old question: Which GPU gives the most peak compute?

Better question: Which system gives the best useful tokens under the latency and cost target?

Old question: Can the model fit?

Better question: Can the model, KV cache, tool loop, tenant traffic pattern, and retry behavior fit without breaking SLOs?

Old question: How many GPUs per node?

Better question: How much CPU, memory, networking, and scheduling capacity do we need per active agent workflow?

Old question: Is the model server healthy?

Better question: Is the serving system preserving cache locality, honoring cancellation, avoiding queue buildup, enforcing policy, and keeping GPUs busy?

Old question: What is the cheapest accelerator hour?

Better question: What is the cheapest correct answer delivered within the product’s latency envelope?

That last one is the whole game.

The agentic stack is mostly coordination

Here is a simple mental model for an agentic inference request:

user task
  -> auth and policy
  -> prompt security
  -> context retrieval
  -> plan
  -> model call
  -> tool call
  -> verify
  -> another model call
  -> stream answer
  -> audit and memory update

Only some of those steps are GPU-heavy. Several are CPU-heavy. Some are network-heavy. Some are storage-heavy. Some are security-sensitive. Some are just annoying in exactly the way production systems are always annoying.

This is why a platform view wins. You cannot optimize only the GPU kernel and declare victory. You also need:

a gateway that understands tokens, tools, and cancellation
a scheduler that understands prefill, decode, and queue fairness
a cache layer that understands prefixes and KV placement
a policy layer that can stop unsafe tool use
observability that measures time-to-first-token and time-per-output-token
capacity planning that includes CPU per GPU, not just GPU count
networking that keeps scale-out from becoming the hidden tax

The fun part is that this makes infrastructure engineering interesting again. The less fun part is that it gives us new ways to be wrong.

Where the original ratio can mislead

The 7-8:1, 3-4:1, 1:1 framing is useful, but only as a sketch.

Real ratios vary by:

model architecture
context length
batch shape
quantization
prefill/decode split
use of disaggregated serving
tool-call intensity
retrieval pattern
tenant mix
whether CPUs are inside the accelerated rack or external to it
whether you count CPU cores, sockets, chips, servers, or total memory bandwidth

A small agent with heavy browser automation may be CPU-heavy. A giant long-context reasoning model may still be GPU-memory-heavy. A retrieval-heavy enterprise assistant might bottleneck on storage and network long before either CPU or GPU is fully happy.

So do not buy infrastructure from a ratio meme.

Use the ratio as a warning: if your architecture assumes the CPU is a boring host, you are probably under-designing the control plane.

My practical take

The CPU is becoming more important because AI is becoming more useful.

That sounds glib, but it is true. Training is mainly about producing weights. Inference is about producing service. Agentic inference is about producing outcomes. Outcomes require coordination.

This is why I like the modern CPU+GPU system direction. Grace Blackwell, GB300, and Vera Rubin do not pretend that the GPU lives alone. They pull CPU, GPU, memory, networking, and software into the same design conversation.

That is exactly where AI infrastructure is going.

The future is not “all GPU.” It is also not “the CPU strikes back.” It is a rack-scale, cluster-scale, software-defined AI factory where the CPU and GPU have clear jobs and the fabric between them is first-class.

The GPU creates the tokens.

The CPU makes sure those tokens become a product.

And the winners will be the platforms that make both look boring in production.

Sources and receipts

Intel Q1 2026 prepared remarks: Comments from CEO Lip-Bu Tan and CFO Dave Zinsner.
GB300 NVL72 specifications: GB300 NVL72 product page.
Microsoft Azure GB300 deployment details: Azure GB300 NVL72 cluster announcement.
Vera CPU positioning: Vera CPU product page.
Vera Rubin platform details: Infrastructure for Scalable AI Reasoning.
Vera Rubin announcement: Vera Rubin Opens Agentic AI Frontier.

YC's 2026 Startup Map: AI Has Left the Chatbox