Autoscaling LLMs by TTFT and TPOT, Not CPU Utilization
CPU utilization is a familiar metric. GPU utilization is a seductive metric. Queue depth is a useful metric.
None of them is the user experience.
The user experience is closer to two questions:
- How long until the first token appears?
- Once tokens start, do they keep arriving smoothly?
That is why autoscaling LLMs like ordinary web services gets awkward fast. A pod can look busy while users wait too long for the first token. A GPU can look healthy while decode streams are backing up. A cluster can have spare aggregate capacity and still violate latency because the constrained phase is the wrong one.
Inference autoscaling has to listen to token pain.
Dynamo Planner is interesting because it moves the conversation from generic machine pressure toward inference-specific capacity control. It can reason about throughput, live load, and SLA-style targets such as Time To First Token (TTFT) and inter-token or per-token latency.
Why web-style scaling feels wrong
Classic autoscaling often starts with CPU utilization, memory utilization, request rate, or queue length. These are not bad metrics. They are just incomplete for LLM serving.
An LLM request has shape:
- input tokens
- expected output tokens
- cache overlap
- prefill cost
- decode duration
- streaming lifetime
- memory footprint
- SLO class
Two requests can have the same HTTP method, same endpoint, same model, and wildly different cost.
A tiny prompt that emits a short answer and a huge long-context agent turn are not equivalent. Scaling on request count alone misses that. Scaling on CPU utilization misses GPU-side pressure. Scaling on GPU utilization misses phase-specific latency. Scaling on average latency misses the tails. Scaling on vibes remains unsupported by Prometheus, which is probably for the best.
The model service needs to know what kind of token work is piling up.
TTFT and TPOT explain different pain
TTFT, Time To First Token, is the “is this thing alive?” metric.
It is affected by:
- frontend and router overhead
- queueing before admission
- prefill work
- cache misses
- transfer delays in disaggregated serving
- cold workers or model loading issues
TPOT, Time Per Output Token, is the “does streaming feel smooth?” metric. Some systems also talk about ITL, inter-token latency, which is the time between output tokens. The Dynamo Planner docs use ITL in configuration examples, while the introduction page frames user targets as TTFT and TPOT. The point is the same: after the first token, users still care about the pace of generation.
TPOT or ITL is affected by:
- decode worker pressure
- batch scheduling
- active output streams
- KV cache memory pressure
- model/backend performance
- topology and communication overhead
If TTFT is bad and TPOT is fine, you probably have a prefill, routing, cache, or admission problem.
If TTFT is fine and TPOT is bad, you probably have a decode or streaming pressure problem.
If both are bad, your system is asking for help with both hands.
Planner is a control loop, not a dashboard
Dynamo’s Planner documentation describes scaling modes that can run independently or together:
- throughput-based scaling
- load-based scaling
Throughput-based scaling uses pre-deployment engine performance data and traffic prediction to compute replica counts needed for latency targets. The docs describe this as the primary mode for production deployments and note a default adjustment interval of 180 seconds.
Load-based scaling uses ForwardPassMetrics from the event plane and online regression to react quickly to bursts. The docs describe a shorter default adjustment interval of 5 seconds.
When both are enabled, throughput-based scaling can provide a capacity floor while load-based scaling reacts above that floor.
That is a very reasonable split. Capacity planning and burst response are different jobs. Asking one loop to do both perfectly is how autoscalers become moody.
Why prefill and decode should scale differently
In aggregated serving, every worker handles the full lifecycle. Scaling adds more full-service workers. That is simple and often correct.
In disaggregated serving, prefill and decode workers can scale independently. This is where inference-aware autoscaling becomes more powerful.
If input prompts are growing and TTFT is slipping, adding decode capacity may do very little. The pressure is before the first token. The platform may need more prefill workers, better cache routing, or less queueing before prefill.
If active streams are piling up and TPOT is slipping, adding prefill capacity may not help. The pressure is in the decode loop. The platform may need more decode workers or different batching behavior.
This phase-specific diagnosis is the difference between scaling and flailing.
The Dynamo architecture docs describe Planner computing prefill and decode targets. The Kubernetes realization also models independent prefill/decode elasticity through separate scaling groups in disaggregated deployments. That matters because the control plane needs a place to express the decision.
The first autoscaling smell: high utilization, bad latency
High utilization sounds efficient. It can also mean the service is running too close to the edge.
For offline batch workloads, high utilization is the goal. For interactive inference, utilization has to share the room with latency. A fully occupied decode pool may be economically beautiful and experientially terrible. A prefill pool that runs hot may create TTFT spikes users interpret as brokenness.
The goal is not “keep GPUs at 100 percent.” The goal is “deliver correct tokens inside the product latency envelope at the best practical cost.”
Those are not the same sentence.
An inference autoscaler should therefore watch:
- TTFT by percentile and traffic class
- TPOT or ITL by percentile and traffic class
- request queue age
- prefill tokens in flight
- active decode blocks
- KV cache utilization and eviction behavior
- cache hit rate
- cancellation rate
- load shedding or 503 rate
- per-model and per-tenant saturation
The exact list depends on the stack, but the principle is stable: scale by the pain the user and runtime actually feel.
The second autoscaling smell: averages look fine
Average latency is where bad tail behavior hides in a hoodie.
If most requests are tiny and a minority are enormous agentic tasks, the average can look acceptable while high-value users experience painful latency. If a few tenants have giant contexts, platform-wide metrics can hide their pain. If decode queues only hurt long-running streams, first-token averages may look cheerful.
Use percentiles. Segment by model, route, tenant class, prompt length, output length, and cache status. Then ask which segment is failing and which phase is constrained.
Dynamo’s direction is helpful because it introduces inference-native signals into the control loop. The more the planner knows about engine performance, queue state, KV cache utilization, and forward-pass metrics, the less it has to pretend all requests are equal.
Be honest about limitations
No autoscaler gets a free pass.
The Planner docs list important limitations. For example, load-based scaling uses ForwardPassMetrics and, at the time of the documentation, FPM availability is called out for vLLM under specific instrumentation conditions. The docs also warn about in-flight requests during scale-down: when a worker is terminated, in-flight requests may fail, including disaggregated cases where decode workers are waiting on KV transfers from a terminated prefill worker.
That limitation is not a footnote. It is production design material.
Scale-up gets the glory. Scale-down causes the weird tickets.
Practical mitigations include:
- keep a steady-state minimum endpoint count
- make scale-down less aggressive than scale-up
- drain before terminating where the stack supports it
- separate user-facing and background workloads
- measure failed requests during scaling events
- test disaggregated scale-down with real KV transfers, not just happy-path synthetic traffic
An autoscaler that protects cost but creates failed generations is not done.
A rollout pattern that does not scare me
For a serious platform, I would phase inference autoscaling like this.
Phase 1: Observe only. Collect TTFT, TPOT or ITL, queue age, cache hit rate, prefill and decode pressure, and route decisions. Do not scale yet. Learn the workload.
Phase 2: Static guardrails. Define minimum replicas, maximum replicas, scale-up limits, scale-down windows, and SLO thresholds. Make the blast radius small.
Phase 3: Throughput planning. Use pre-deployment profiling or benchmark data to set a stable capacity floor for known traffic patterns.
Phase 4: Load reaction. Add load-based scaling for bursts, with conservative scale-down.
Phase 5: Phase-specific tuning. In disaggregated deployments, let prefill and decode pools scale according to the phase actually failing.
Phase 6: Cost refinement. Only after SLOs are stable should the team tune for lower cost per token.
Cost optimization before latency correctness is how you build a cheaper bad experience.
What this means for architecture reviews
The old review question was:
“How many replicas do we need?”
The better question is:
“What token SLO are we protecting, for which workload segment, and which phase gets scaled when it fails?”
That question forces clarity.
It forces teams to define TTFT and per-token latency targets. It forces them to separate prefill from decode pressure. It forces them to consider cache behavior. It forces them to document whether the system should drop, queue, shed, or degrade when overloaded.
It also prevents the very common mistake of treating model serving as an HTTP deployment with GPUs attached.
The practical takeaway
Autoscaling LLMs by CPU utilization is like driving by listening to engine noise from the next street over. You may infer something, but you are not looking at the road.
For LLM serving, the road is token latency.
TTFT tells you how long users wait before the model shows signs of life. TPOT or ITL tells you whether generation continues smoothly. Cache and phase metrics explain why those numbers move. Planner-style control loops turn those signals into capacity.
That is the right direction for the AI cloud: not more autoscaling theater, but feedback loops based on the units users actually experience.
Tokens are the product.
Scale for them.
Sources and receipts
- Dynamo Planner scaling modes, targets, and limitations: Planner.
- Dynamo introduction on Planner and TTFT/TPOT SLA framing: Introduction to Dynamo.
- Architecture notes on Planner control loops and prefill/decode targets: Overall Architecture.
- Disaggregated serving inputs for TTFT and TPOT: Disaggregated Serving.
