Skip to content
Speculative Decoding in Production: When Draft Tokens Help and When They Hurt

Speculative Decoding in Production: When Draft Tokens Help and When They Hurt

Speculative decoding is one of those ideas that sounds like cheating until you read the trick.

Instead of asking the large model to produce one token at a time, ask a smaller or cheaper draft process to guess a few tokens ahead. Then ask the large model to verify those guesses in parallel. If the guesses are accepted, you moved faster. If they are rejected, you fall back safely.

The original speculative decoding paper showed that this can accelerate autoregressive inference without changing the output distribution, under the algorithm’s assumptions. That last clause matters. In production, assumptions are where latency bugs like to hide snacks.

The basic loop

Speculative decoding loopDraft model proposes tokens, target model verifies them, accepted tokens are emitted and rejected tokens fall back.Speculative decoding: guess, verify, emitDraftpredict N tokensVerifytarget model checksAcceptRejectspeedup comes from accepting multiple drafted tokens per target pass
The draft model is not allowed to invent a new distribution. It is auditioning tokens for the target model.

The speedup depends on acceptance rate and overhead. If the draft is accurate and cheap, you win. If the draft is wrong too often or expensive to run, you added another moving part and got a smaller bill for patience.

What to measure

Do not roll out speculative decoding with only “tokens per second improved” on the dashboard.

Measure:

  • draft tokens proposed per step
  • accepted tokens per verification
  • rejection rate by prompt type
  • target model forward passes saved
  • end-to-end latency
  • TTFT impact
  • TPOT impact
  • GPU memory overhead
  • quality parity
  • failure cases by model and sampling config

Acceptance rate is the heartbeat. A high acceptance rate means the draft path matches the target well for your workload. A low acceptance rate means you are running a tiny model mostly to be told “nice try.”

Where it helps

Speculative decoding tends to help when:

  • generation is long enough for decode speed to matter
  • the draft model is much cheaper than the target
  • the target model can verify multiple candidates efficiently
  • outputs are predictable enough for high acceptance
  • the serving engine supports the pattern cleanly

It is particularly interesting for latency-sensitive chat and coding workloads where the decode loop dominates user experience.

NVIDIA TensorRT-LLM, vLLM, and SGLang all have speculative decoding stories. SGLang documents EAGLE-style decoding paths. vLLM documents speculative decoding configuration. NVIDIA has published optimization work around speculative decoding in TensorRT-LLM. This is no longer a lab-only trick; it is entering the standard serving toolbox.

Where it hurts

Speculative decoding can hurt when:

  • prompts are short and prefill dominates
  • outputs are very creative or high-temperature
  • draft and target tokenizers or vocabularies complicate verification
  • the draft model consumes memory that would have been used for batch size
  • operational metrics do not distinguish drafted, accepted, and rejected tokens
  • quality changes sneak in through an implementation mismatch
Speculative decoding tradeoffChart-like diagram showing speedup rising with acceptance rate and falling with overhead.The speedup lives between acceptance and overheadaccepted draftsoverheadhigher acceptance rate -> better outcomespeedup
Speculative decoding is great when the draft is right often enough. Otherwise it is an intern with a stopwatch.

Acceptance-rate math without drama

The mental model is simple:

value = accepted drafted tokens - draft overhead - verification overhead

A 90% acceptance rate can still be bad if the draft path consumes scarce memory and reduces batch size. A 60% acceptance rate can still be useful if the draft is extremely cheap and decode dominates the workload. The acceptance number is not the decision by itself; it is the first number that tells you where to look.

Segment it by prompt class. Code completion, customer support, summarization, tool-using agents, and high-temperature creative writing will not behave the same. One global acceptance-rate chart is a very efficient way to hide the useful answer.

Production rollout

Roll it out like this:

  1. Start with one model pair and one workload.
  2. Run A/B traffic with identical prompts.
  3. Compare output quality and determinism where relevant.
  4. Track accepted tokens separately from drafted tokens.
  5. Add per-prompt-class acceptance metrics.
  6. Cap speculation depth until rejection behavior is understood.
  7. Watch GPU memory pressure.
  8. Keep an easy kill switch.

The kill switch matters. Speculative decoding is sensitive to workload shape. A release that changes prompt format, sampling parameters, or model version can change acceptance behavior.

The NVIDIA angle

The most interesting NVIDIA-specific part is not simply “TensorRT-LLM supports speculative decoding.” It is that speculative decoding combines with lower precision, better attention kernels, faster interconnects, and scheduling. On Blackwell-class systems, the value of verifying more work per target pass improves when the surrounding stack can keep the pipeline fed.

But be careful with benchmark translation. A 2x or 3x result in one setup does not mean your agent workload gets 3x cheaper. If your workload is prefill-heavy, cache-miss-heavy, or tool-latency-heavy, speculative decoding may be a minor character.

Use it where decode is the bottleneck. Measure it like a production feature, not a magic trick.

Closing

Speculative decoding is one of the cleanest examples of modern inference engineering: use a little extra machinery to reduce serial work in the decode loop.

It is elegant. It is useful. It is not automatic.

The winning teams will not be the ones who enable the flag first. They will be the ones who know when the flag is paying for itself.

Sources and receipts