2/20 - Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge
A large language model is often slow at decoding not because one token is impossibly expensive, but because tokens must normally be produced one after another. Speculative decoding asks a cheaper model to sketch several likely next tokens, then lets the target model verify the whole sketch in one pass.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
Think of a junior editor drafting the next sentence while a senior editor reviews several words at once. Accepted words move straight through. At the first rejected word, the senior editor corrects the draft and the process starts again.
How to read this diagram: Start with Draft model, where propose k tokens. The middle stage, Target model, verify in one pass. The final stage, Sampler, shows the observable result: correct first rejection. The arrows describe dependency order, not necessarily separate services.
What actually happens
The draft model proposes a short token sequence. The target model evaluates those proposed positions in parallel. A rejection-sampling rule accepts a prefix and, when needed, samples a correction from the residual distribution. The original algorithm preserves the target model output distribution rather than approximating it.
Speed depends on the acceptance length: how many proposed tokens survive per target-model verification. A draft that is fast but consistently wrong adds overhead. A draft that is almost as expensive as the target saves little. The useful point lies between those extremes.
Serving adds complications. Draft and target tokenizers must align; sampling settings must be handled consistently; KV state for accepted and rejected branches must be committed or discarded correctly; and batching must accommodate requests with different acceptance lengths.
A worked example
Let the draft propose four tokens. If the target accepts three on average, one expensive verification advances generation by roughly three tokens instead of one. But if the draft consumes 40 percent of target latency and only 1.2 tokens are accepted, total latency may increase. Acceptance alone is not the goal; accepted tokens per unit of combined draft-plus-verify time is.
The performance model
A useful approximation is speedup = baseline target time divided by draft time plus verification time per accepted advance. Memory bandwidth, batch size, target-model utilization, and draft placement all affect that denominator. Benchmark end-to-end TPOT, not only acceptance percentage.
How to read this diagram: The left panel asks how Speculative decoding changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.
Expert lens
Speculation can reduce single-request latency while hurting server throughput if verification batches become irregular or the draft competes for the same scarce GPU. Some systems colocate the draft, others reserve a smaller GPU, and self-speculative methods reuse early layers. The scheduler is part of the algorithm.
How to read this diagram: The left panel is the baseline, Ordinary decoding, characterized by one target step per token and simple scheduler. The right panel applies Speculative decoding, changing the cost profile to several candidates per verify and acceptance varies by prompt. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.
Where it wins
- Latency-sensitive generation with a well-matched draft
- Memory-bound target-model decode
- Domains where local token patterns are predictable
Where it disappoints
- Assuming a smaller draft is automatically a better draft
- Comparing quality with mismatched sampling parameters
- Ignoring draft GPU cost in throughput calculations
- Failing to roll back rejected KV branches
Production checklist
- Measure accepted tokens per verification by workload
- Include draft and target cost in the benchmark
- Validate distributional equivalence for sampling modes
- Stress mixed batches with divergent acceptance lengths
- Export draft latency, verify latency, and rejection position
What to measure
- Mean and percentile acceptance length
- Target forward passes per output token
- End-to-end TPOT and tokens per second
- Draft and target GPU utilization
- Rollback tokens and rejected KV bytes
From one GPU to a production service
A lab benchmark usually runs one prompt with a draft and target already resident. A production service handles changing batch sizes, temperatures, adapters, and domains. Eligibility should be explicit: some requests speculate, some use ordinary decoding, and the scheduler must combine them without letting draft work delay non-speculative traffic.
Draft selection can be a routing problem. Code, conversational prose, and structured extraction may have different acceptance profiles. A service can maintain acceptance and cost statistics by model pair and workload class, then disable speculation when the expected gain falls below a safety margin.
Roll out with shadow verification first: run the draft path, compare committed output and accounting against ordinary target decoding, but do not serve it. Then enable a small traffic slice with hard rollback, latency, and cost guards.
Design-review questions
- Is the output distribution exactly preserved for every supported sampler?
- Who owns draft capacity during target overload?
- How is speculative KV state isolated and reclaimed?
- What acceptance level is required to break even?
- Can the service disable speculation without reloading the target?
How it connects to the rest of the series
Parallel decoding removes serial steps using multiple heads or candidate trees; early exit can make the target model draft for itself; continuous batching must schedule the variable progress that speculation creates.
From equation to implementation
Let q be the draft distribution and p the target distribution. The verifier accepts a proposed token with probability min(1, p(x)/q(x)) in the classic sampling construction. On rejection it samples from a corrected residual distribution. That detail is why exact speculative sampling can preserve p rather than merely choosing whatever the draft suggested.
Expected progress depends on correlated acceptance across the proposed block. If alpha is a rough per-token acceptance probability and k tokens are drafted, the expected accepted prefix is a truncated geometric sum. Increasing k has diminishing returns when alpha is modest, while verification work and temporary state continue to grow.
Implementation sketch
while not finished:
draft_tokens, q_probs = draft.propose(prefix, k)
p_probs = target.verify(prefix, draft_tokens)
accepted = accept_prefix(p_probs, q_probs, rng)
commit(accepted.tokens)
if accepted.rejected:
commit(sample_residual(p_probs, q_probs, rng))
rollback_uncommitted_kv()Capacity planning
Draft placement is part of capacity planning. A draft on the target GPU consumes memory and scheduling slots; a remote draft adds network latency; CPU drafting may be too slow. Calculate speedup using shared end-to-end resources, not an isolated draft benchmark.
Benchmarking without fooling yourself
- Use identical random seeds and sampling settings for equivalence tests.
- Report acceptance by prompt domain, temperature, and output position.
- Measure target passes per committed token and combined GPU-seconds.
- Include mixed traffic where only some models or requests speculate.
A production failure to design for
A tempting implementation commits draft KV before verification and tries to overwrite rejected positions. Under cancellation or branch mismatch, stale entries survive and poison later attention. Treat speculative state as transactional: reserve, verify, then atomically commit the accepted prefix.
How to read this diagram: The operating cycle moves from Pair to Verify, then Measure and Tune. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.
Deeper engineering guide
The speedup comes from replacing several sequential target-model decode calls with one wider verification call. If the draft proposes k tokens and the target accepts a on average, useful progress per target invocation is approximately a + 1 including the correction token. That gain must exceed draft latency, verification overhead, and rollback cost. Acceptance rate alone is insufficient: two configurations with equal acceptance can have different accepted-run lengths and therefore different speedups.
How to read this diagram: Follow the state from Draft through Verify and Commit to Recover. Each box is an ownership or computation boundary. In particular, correctness boundary: provisional state becomes visible only after target verification. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.
Draft and target tokenizers, vocabulary ordering, sampling transforms, and probability support must agree. Greedy verification is simpler but changes the applicable correctness argument. Sampling-preserving schemes compare target and draft distributions and sample from a corrected residual after rejection. Logit processors, banned-token lists, grammar masks, and temperature must be applied consistently at the correct stage.
How to read this diagram: The bars compare Ordinary decode with Speculative on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Fewer sequential target-model round trips.”, remains larger than the risk, “Draft work and rejected KV consume real capacity.”, under production traffic.
The scheduler decides whether the draft shares the target GPU, runs on a smaller device, or reuses early target layers. Colocation avoids network transfer but competes for memory bandwidth. A separate draft GPU isolates work but introduces transport and synchronization. Measure combined fleet tokens per second, not only target-model tokens per second.
How to read this diagram: State advances from Eligible to Provisional, Verified, and finally Committed. The labels below each state identify what becomes true at that boundary. The governing invariant is: Cancellation or mismatch may discard provisional state, never committed history. Retries and cancellation must preserve the same transition rules.
How to read this diagram: The four panels are independent review axes: Acceptance, Placement, Sampling, and Scheduler. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Optimize end-to-end goodput subject to identical output semantics.
How to read this diagram: This is a causal chain, not four unrelated symptoms. Domain shifts triggers Verification grows, which creates Fleet slows. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.
Primary references
- Fast Inference from Transformers via Speculative Decoding
- Accelerating Large Language Model Decoding with Speculative Sampling
- TensorRT-LLM Speculative Decoding
The takeaway
Speculative decoding is not free parallelism. It is a carefully accounted bet that cheap guesses will let expensive verification advance more than one token at a time.
