Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge

#speculative-decoding #llm-inference #decoding #latency #draft-model

A large language model is often slow at decoding not because one token is impossibly expensive, but because tokens must normally be produced one after another. Speculative decoding asks a cheaper model to sketch several likely next tokens, then lets the target model verify the whole sketch in one pass.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Think of a junior editor drafting the next sentence while a senior editor reviews several words at once. Accepted words move straight through. At the first rejected word, the senior editor corrects the draft and the process starts again.

Follow the state and work from left to right.

What actually happens

The draft model proposes a short token sequence. The target model evaluates those proposed positions in parallel. A rejection-sampling rule accepts a prefix and, when needed, samples a correction from the residual distribution. The original algorithm preserves the target model output distribution rather than approximating it.

Speed depends on the acceptance length: how many proposed tokens survive per target-model verification. A draft that is fast but consistently wrong adds overhead. A draft that is almost as expensive as the target saves little. The useful point lies between those extremes.

Serving adds complications. Draft and target tokenizers must align; sampling settings must be handled consistently; KV state for accepted and rejected branches must be committed or discarded correctly; and batching must accommodate requests with different acceptance lengths.

A worked example

Let the draft propose four tokens. If the target accepts three on average, one expensive verification advances generation by roughly three tokens instead of one. But if the draft consumes 40 percent of target latency and only 1.2 tokens are accepted, total latency may increase. Acceptance alone is not the goal; accepted tokens per unit of combined draft-plus-verify time is.

The performance model

A useful approximation is speedup = baseline target time divided by draft time plus verification time per accepted advance. Memory bandwidth, batch size, target-model utilization, and draft placement all affect that denominator. Benchmark end-to-end TPOT, not only acceptance percentage.

Expert lens

Speculation can reduce single-request latency while hurting server throughput if verification batches become irregular or the draft competes for the same scarce GPU. Some systems colocate the draft, others reserve a smaller GPU, and self-speculative methods reuse early layers. The scheduler is part of the algorithm.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Latency-sensitive generation with a well-matched draft
Memory-bound target-model decode
Domains where local token patterns are predictable

Where it disappoints

Assuming a smaller draft is automatically a better draft
Comparing quality with mismatched sampling parameters
Ignoring draft GPU cost in throughput calculations
Failing to roll back rejected KV branches

Production checklist

Measure accepted tokens per verification by workload
Include draft and target cost in the benchmark
Validate distributional equivalence for sampling modes
Stress mixed batches with divergent acceptance lengths
Export draft latency, verify latency, and rejection position

What to measure

Mean and percentile acceptance length
Target forward passes per output token
End-to-end TPOT and tokens per second
Draft and target GPU utilization
Rollback tokens and rejected KV bytes

From one GPU to a production service

A lab benchmark usually runs one prompt with a draft and target already resident. A production service handles changing batch sizes, temperatures, adapters, and domains. Eligibility should be explicit: some requests speculate, some use ordinary decoding, and the scheduler must combine them without letting draft work delay non-speculative traffic.

Draft selection can be a routing problem. Code, conversational prose, and structured extraction may have different acceptance profiles. A service can maintain acceptance and cost statistics by model pair and workload class, then disable speculation when the expected gain falls below a safety margin.

Roll out with shadow verification first: run the draft path, compare committed output and accounting against ordinary target decoding, but do not serve it. Then enable a small traffic slice with hard rollback, latency, and cost guards.

Design-review questions

Is the output distribution exactly preserved for every supported sampler?
Who owns draft capacity during target overload?
How is speculative KV state isolated and reclaimed?
What acceptance level is required to break even?
Can the service disable speculation without reloading the target?

How it connects to the rest of the series

Parallel decoding removes serial steps using multiple heads or candidate trees; early exit can make the target model draft for itself; continuous batching must schedule the variable progress that speculation creates.

From equation to implementation

Let q be the draft distribution and p the target distribution. The verifier accepts a proposed token with probability min(1, p(x)/q(x)) in the classic sampling construction. On rejection it samples from a corrected residual distribution. That detail is why exact speculative sampling can preserve p rather than merely choosing whatever the draft suggested.

Expected progress depends on correlated acceptance across the proposed block. If alpha is a rough per-token acceptance probability and k tokens are drafted, the expected accepted prefix is a truncated geometric sum. Increasing k has diminishing returns when alpha is modest, while verification work and temporary state continue to grow.

Implementation sketch

while not finished:
    draft_tokens, q_probs = draft.propose(prefix, k)
    p_probs = target.verify(prefix, draft_tokens)
    accepted = accept_prefix(p_probs, q_probs, rng)
    commit(accepted.tokens)
    if accepted.rejected:
        commit(sample_residual(p_probs, q_probs, rng))
    rollback_uncommitted_kv()

Capacity planning

Draft placement is part of capacity planning. A draft on the target GPU consumes memory and scheduling slots; a remote draft adds network latency; CPU drafting may be too slow. Calculate speedup using shared end-to-end resources, not an isolated draft benchmark.

Benchmarking without fooling yourself

Use identical random seeds and sampling settings for equivalence tests.
Report acceptance by prompt domain, temperature, and output position.
Measure target passes per committed token and combined GPU-seconds.
Include mixed traffic where only some models or requests speculate.

A production failure to design for

A tempting implementation commits draft KV before verification and tries to overwrite rejected positions. Under cancellation or branch mismatch, stale entries survive and poison later attention. Treat speculative state as transactional: reserve, verify, then atomically commit the accepted prefix.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Speculative decoding is not free parallelism. It is a carefully accounted bet that cheap guesses will let expensive verification advance more than one token at a time.

KV Caching: The Memory That Makes Token Generation Possible FlashAttention: Why Moving Fewer Bytes Beats Doing Fewer FLOPs