Skip to content
2/20 - Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge

2/20 - Speculative Decoding: Let a Small Model Guess, Let a Large Model Judge

A large language model is often slow at decoding not because one token is impossibly expensive, but because tokens must normally be produced one after another. Speculative decoding asks a cheaper model to sketch several likely next tokens, then lets the target model verify the whole sketch in one pass.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Think of a junior editor drafting the next sentence while a senior editor reviews several words at once. Accepted words move straight through. At the first rejected word, the senior editor corrects the draft and the process starts again.

MECHANISM FLOWSpeculative Decoding: request path01Draft modelPropose k tokensCheap serial guesses02Target modelVerify in one passAccept valid prefix03SamplerCorrect first rejectionPreserve distributionINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Draft model, where propose k tokens. The middle stage, Target model, verify in one pass. The final stage, Sampler, shows the observable result: correct first rejection. The arrows describe dependency order, not necessarily separate services.

What actually happens

The draft model proposes a short token sequence. The target model evaluates those proposed positions in parallel. A rejection-sampling rule accepts a prefix and, when needed, samples a correction from the residual distribution. The original algorithm preserves the target model output distribution rather than approximating it.

Speed depends on the acceptance length: how many proposed tokens survive per target-model verification. A draft that is fast but consistently wrong adds overhead. A draft that is almost as expensive as the target saves little. The useful point lies between those extremes.

Serving adds complications. Draft and target tokenizers must align; sampling settings must be handled consistently; KV state for accepted and rejected branches must be committed or discarded correctly; and batching must accommodate requests with different acceptance lengths.

A worked example

Let the draft propose four tokens. If the target accepts three on average, one expensive verification advances generation by roughly three tokens instead of one. But if the draft consumes 40 percent of target latency and only 1.2 tokens are accepted, total latency may increase. Acceptance alone is not the goal; accepted tokens per unit of combined draft-plus-verify time is.

The performance model

A useful approximation is speedup = baseline target time divided by draft time plus verification time per accepted advance. Memory bandwidth, batch size, target-model utilization, and draft placement all affect that denominator. Benchmark end-to-end TPOT, not only acceptance percentage.

PHASE FITWhere Speculative decoding changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityUsually little direct prefill benefitDECODEOne new token per iterationWeight and KV bandwidth pressureFewer sequential target-model passesPROVE IT WITHAccepted tokens/pass and TPOTDEPLOYMENT DECISIONEnable only for positive net goodput
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Speculative decoding changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Speculation can reduce single-request latency while hurting server throughput if verification batches become irregular or the draft competes for the same scarce GPU. Some systems colocate the draft, others reserve a smaller GPU, and self-speculative methods reuse early layers. The scheduler is part of the algorithm.

TRADE-OFF MAPSpeculative Decoding: the tradeoffBASELINEOrdinary decodingOne target step per tokenSimple schedulerPredictable KV growthNo draft overheadVSOPTIMIZEDSpeculative decodingSeveral candidates per verifyAcceptance varies by promptExtra draft compute and statePotentially lower TPOTMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Ordinary decoding, characterized by one target step per token and simple scheduler. The right panel applies Speculative decoding, changing the cost profile to several candidates per verify and acceptance varies by prompt. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Latency-sensitive generation with a well-matched draft
  • Memory-bound target-model decode
  • Domains where local token patterns are predictable

Where it disappoints

  • Assuming a smaller draft is automatically a better draft
  • Comparing quality with mismatched sampling parameters
  • Ignoring draft GPU cost in throughput calculations
  • Failing to roll back rejected KV branches

Production checklist

  • Measure accepted tokens per verification by workload
  • Include draft and target cost in the benchmark
  • Validate distributional equivalence for sampling modes
  • Stress mixed batches with divergent acceptance lengths
  • Export draft latency, verify latency, and rejection position

What to measure

  • Mean and percentile acceptance length
  • Target forward passes per output token
  • End-to-end TPOT and tokens per second
  • Draft and target GPU utilization
  • Rollback tokens and rejected KV bytes

From one GPU to a production service

A lab benchmark usually runs one prompt with a draft and target already resident. A production service handles changing batch sizes, temperatures, adapters, and domains. Eligibility should be explicit: some requests speculate, some use ordinary decoding, and the scheduler must combine them without letting draft work delay non-speculative traffic.

Draft selection can be a routing problem. Code, conversational prose, and structured extraction may have different acceptance profiles. A service can maintain acceptance and cost statistics by model pair and workload class, then disable speculation when the expected gain falls below a safety margin.

Roll out with shadow verification first: run the draft path, compare committed output and accounting against ordinary target decoding, but do not serve it. Then enable a small traffic slice with hard rollback, latency, and cost guards.

Design-review questions

  • Is the output distribution exactly preserved for every supported sampler?
  • Who owns draft capacity during target overload?
  • How is speculative KV state isolated and reclaimed?
  • What acceptance level is required to break even?
  • Can the service disable speculation without reloading the target?

How it connects to the rest of the series

Parallel decoding removes serial steps using multiple heads or candidate trees; early exit can make the target model draft for itself; continuous batching must schedule the variable progress that speculation creates.

From equation to implementation

Let q be the draft distribution and p the target distribution. The verifier accepts a proposed token with probability min(1, p(x)/q(x)) in the classic sampling construction. On rejection it samples from a corrected residual distribution. That detail is why exact speculative sampling can preserve p rather than merely choosing whatever the draft suggested.

Expected progress depends on correlated acceptance across the proposed block. If alpha is a rough per-token acceptance probability and k tokens are drafted, the expected accepted prefix is a truncated geometric sum. Increasing k has diminishing returns when alpha is modest, while verification work and temporary state continue to grow.

Implementation sketch

while not finished:
    draft_tokens, q_probs = draft.propose(prefix, k)
    p_probs = target.verify(prefix, draft_tokens)
    accepted = accept_prefix(p_probs, q_probs, rng)
    commit(accepted.tokens)
    if accepted.rejected:
        commit(sample_residual(p_probs, q_probs, rng))
    rollback_uncommitted_kv()

Capacity planning

Draft placement is part of capacity planning. A draft on the target GPU consumes memory and scheduling slots; a remote draft adds network latency; CPU drafting may be too slow. Calculate speedup using shared end-to-end resources, not an isolated draft benchmark.

Benchmarking without fooling yourself

  • Use identical random seeds and sampling settings for equivalence tests.
  • Report acceptance by prompt domain, temperature, and output position.
  • Measure target passes per committed token and combined GPU-seconds.
  • Include mixed traffic where only some models or requests speculate.

A production failure to design for

A tempting implementation commits draft KV before verification and tries to overwrite rejected positions. Under cancellation or branch mismatch, stale entries survive and poison later attention. Treat speculative state as transactional: reserve, verify, then atomically commit the accepted prefix.

OPERATING LOOPOperational loop1PairChoose draft modelMatch tokenizer2VerifyCheck distributionTest rollback3MeasureAccepted tokens per passCombined GPU cost4TuneDraft length and placementTraffic eligibilityMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Pair to Verify, then Measure and Tune. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

The speedup comes from replacing several sequential target-model decode calls with one wider verification call. If the draft proposes k tokens and the target accepts a on average, useful progress per target invocation is approximately a + 1 including the correction token. That gain must exceed draft latency, verification overhead, and rollback cost. Acceptance rate alone is insufficient: two configurations with equal acceptance can have different accepted-run lengths and therefore different speedups.

A speculative decode transactionDraftPropose k tokensKeep provisional KVDo not publishVerifyOne target passScore every positionFind first rejectionCommitAccept valid prefixAppend target KVEmit safe tokensRecoverSample correctionDiscard stale branchStart next roundCorrectness boundary: provisional state becomes visible only after target verification.
Speculation is a transaction over tokens and KV state, not a loose race between models.

How to read this diagram: Follow the state from Draft through Verify and Commit to Recover. Each box is an ownership or computation boundary. In particular, correctness boundary: provisional state becomes visible only after target verification. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Draft and target tokenizers, vocabulary ordering, sampling transforms, and probability support must agree. Greedy verification is simpler but changes the applicable correctness argument. Sampling-preserving schemes compare target and draft distributions and sample from a corrected residual after rejection. Logit processors, banned-token lists, grammar masks, and temperature must be applied consistently at the correct stage.

Accepted progress must repay speculation overheadOrdinary decode1 token / target passSpeculative4.2 tokens / target passHidden costDraft work and rejected KV consume real capacity.Useful gainFewer sequential target-model round trips.
Illustrative progress per target invocation; production results depend on acceptance and placement.

How to read this diagram: The bars compare Ordinary decode with Speculative on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Fewer sequential target-model round trips.”, remains larger than the risk, “Draft work and rejected KV consume real capacity.”, under production traffic.

The scheduler decides whether the draft shares the target GPU, runs on a smaller device, or reuses early target layers. Colocation avoids network transfer but competes for memory bandwidth. A separate draft GPU isolates work but introduces transport and synchronization. Measure combined fleet tokens per second, not only target-model tokens per second.

State ownership across one speculative roundEligiblepolicy allows draftProvisionalbranch KV is privateVerifiedtarget fixes boundaryCommittedtokens become visibleCancellation or mismatch may discard provisional state, never committed history.
Explicit states prevent rejected branches from leaking into later attention.

How to read this diagram: State advances from Eligible to Provisional, Verified, and finally Committed. The labels below each state identify what becomes true at that boundary. The governing invariant is: Cancellation or mismatch may discard provisional state, never committed history. Retries and cancellation must preserve the same transition rules.

Four controls determine real speedupAcceptanceMeasure run-length distributionSegment by prompt domainPlacementAccount for draft resourcesInclude transfer latencySamplingPreserve target distributionMirror masks and processorsSchedulerBatch verification shapesProtect ordinary trafficOptimize end-to-end goodput subject to identical output semantics.
Speculative decoding is a model-pair, sampling, and scheduling problem at the same time.

How to read this diagram: The four panels are independent review axes: Acceptance, Placement, Sampling, and Scheduler. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Optimize end-to-end goodput subject to identical output semantics.

How a promising draft becomes negative speedupDomain shiftsDraft predicts poorlyAccepted runs shrinkVerification growsWide passes do littleBranches roll backFleet slowsDraft steals capacityTail latency risesControlDisable by cohortFall back instantlyEligibility should react to measured accepted progress, not a global feature flag.
A safe controller removes speculation when its marginal value turns negative.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Domain shifts triggers Verification grows, which creates Fleet slows. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Speculative decoding is not free parallelism. It is a carefully accounted bet that cheap guesses will let expensive verification advance more than one token at a time.