Speculative Decoding in Production: When Draft Tokens Help and When They Hurt

#inference #speculative-decoding #llm #tensorrt-llm #vllm #sglang #performance

Speculative decoding is one of those ideas that sounds like cheating until you read the trick.

Instead of asking the large model to produce one token at a time, ask a smaller or cheaper draft process to guess a few tokens ahead. Then ask the large model to verify those guesses in parallel. If the guesses are accepted, you moved faster. If they are rejected, you fall back safely.

The original speculative decoding paper showed that this can accelerate autoregressive inference without changing the output distribution, under the algorithm’s assumptions. That last clause matters. In production, assumptions are where latency bugs like to hide snacks.

The basic loop

The draft model is not allowed to invent a new distribution. It is auditioning tokens for the target model.

The speedup depends on acceptance rate and overhead. If the draft is accurate and cheap, you win. If the draft is wrong too often or expensive to run, you added another moving part and got a smaller bill for patience.

What to measure

Do not roll out speculative decoding with only “tokens per second improved” on the dashboard.

Measure:

draft tokens proposed per step
accepted tokens per verification
rejection rate by prompt type
target model forward passes saved
end-to-end latency
TTFT impact
TPOT impact
GPU memory overhead
quality parity
failure cases by model and sampling config

Acceptance rate is the heartbeat. A high acceptance rate means the draft path matches the target well for your workload. A low acceptance rate means you are running a tiny model mostly to be told “nice try.”

Where it helps

Speculative decoding tends to help when:

generation is long enough for decode speed to matter
the draft model is much cheaper than the target
the target model can verify multiple candidates efficiently
outputs are predictable enough for high acceptance
the serving engine supports the pattern cleanly

It is particularly interesting for latency-sensitive chat and coding workloads where the decode loop dominates user experience.

NVIDIA TensorRT-LLM, vLLM, and SGLang all have speculative decoding stories. SGLang documents EAGLE-style decoding paths. vLLM documents speculative decoding configuration. NVIDIA has published optimization work around speculative decoding in TensorRT-LLM. This is no longer a lab-only trick; it is entering the standard serving toolbox.

Where it hurts

Speculative decoding can hurt when:

prompts are short and prefill dominates
outputs are very creative or high-temperature
draft and target tokenizers or vocabularies complicate verification
the draft model consumes memory that would have been used for batch size
operational metrics do not distinguish drafted, accepted, and rejected tokens
quality changes sneak in through an implementation mismatch

Speculative decoding is great when the draft is right often enough. Otherwise it is an intern with a stopwatch.

Acceptance-rate math without drama

The mental model is simple:

value = accepted drafted tokens - draft overhead - verification overhead

A 90% acceptance rate can still be bad if the draft path consumes scarce memory and reduces batch size. A 60% acceptance rate can still be useful if the draft is extremely cheap and decode dominates the workload. The acceptance number is not the decision by itself; it is the first number that tells you where to look.

Segment it by prompt class. Code completion, customer support, summarization, tool-using agents, and high-temperature creative writing will not behave the same. One global acceptance-rate chart is a very efficient way to hide the useful answer.

Production rollout

Roll it out like this:

Start with one model pair and one workload.
Run A/B traffic with identical prompts.
Compare output quality and determinism where relevant.
Track accepted tokens separately from drafted tokens.
Add per-prompt-class acceptance metrics.
Cap speculation depth until rejection behavior is understood.
Watch GPU memory pressure.
Keep an easy kill switch.

The kill switch matters. Speculative decoding is sensitive to workload shape. A release that changes prompt format, sampling parameters, or model version can change acceptance behavior.

The NVIDIA angle

The most interesting NVIDIA-specific part is not simply “TensorRT-LLM supports speculative decoding.” It is that speculative decoding combines with lower precision, better attention kernels, faster interconnects, and scheduling. On Blackwell-class systems, the value of verifying more work per target pass improves when the surrounding stack can keep the pipeline fed.

But be careful with benchmark translation. A 2x or 3x result in one setup does not mean your agent workload gets 3x cheaper. If your workload is prefill-heavy, cache-miss-heavy, or tool-latency-heavy, speculative decoding may be a minor character.

Use it where decode is the bottleneck. Measure it like a production feature, not a magic trick.

Closing

Speculative decoding is one of the cleanest examples of modern inference engineering: use a little extra machinery to reduce serial work in the decode loop.

It is elegant. It is useful. It is not automatic.

The winning teams will not be the ones who enable the flag first. They will be the ones who know when the flag is paying for itself.

Sources and receipts

Original method: Fast Inference from Transformers via Speculative Decoding.
Survey: Unlocking Efficiency in Large Language Model Inference.
vLLM speculative decoding docs: vLLM documentation.
SGLang speculative decoding docs: SGLang documentation.
TensorRT-LLM and NVIDIA optimization context: TensorRT-LLM docs and NVIDIA inference performance blog.

From Prefill to Decode: Disaggregated Inference as a Distributed Systems Problem From H100 to Blackwell: What Actually Changes for Inference Architects