Skip to content
Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production

Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production

Speculative decoding and quantization both promise faster inference. They do it in completely different ways.

Speculative decoding tries to reduce the number of expensive target-model decode steps. Quantization tries to make each step cheaper by using smaller numerical formats and moving less memory.

One changes the algorithmic schedule. The other changes representation and kernels.

The best production systems often use both, but not because “more optimization” is automatically better. They use both when the workload, model, quality target, and serving runtime agree.

What speculative decoding optimizes

Autoregressive decoding is sequential. Generate a token, append it, generate the next token. Speculative decoding adds a draft path:

  1. A cheaper draft model or draft head predicts several tokens.
  2. The target model verifies those tokens.
  3. Accepted tokens are emitted.
  4. Rejected tokens fall back safely.

The original speculative decoding work showed that, under the algorithm’s assumptions, this can preserve the target distribution while reducing target-model passes. That is the magic: speed without changing the answer distribution. Production still has to measure it carefully because implementation details, sampling, model mismatch, and workload shape matter.

Speculative decoding helps most when:

  • decode dominates latency
  • output is long enough
  • draft model is cheap
  • acceptance rate is high
  • runtime verifies efficiently
  • memory overhead is acceptable

It helps least when:

  • prefill dominates
  • output is short
  • draft tokens are often rejected
  • draft model consumes scarce GPU memory
  • tool latency dominates the workflow

What quantization optimizes

Quantization changes numeric precision. Instead of FP16/BF16 everywhere, you may use FP8, INT8, INT4, AWQ, GPTQ, SmoothQuant-style approaches, or runtime-specific formats.

The goals:

  • reduce model memory footprint
  • reduce memory bandwidth pressure
  • increase effective throughput
  • fit larger models or longer contexts
  • lower cost per token

The tradeoff is quality and compatibility. Some models tolerate lower precision well. Some tasks expose degradation quickly. Some kernels are fast only for certain shapes, GPUs, and batch sizes. NVIDIA’s TensorRT-LLM, vLLM, and SGLang all support quantization paths, but the practical win depends on the exact model, hardware, and runtime.

Speculative decoding versus quantizationA comparison showing speculative decoding reduces target decode passes while quantization makes each pass cheaper.Two speedups, two different leversSpeculation tries to need fewer target passes. Quantization tries to make each pass cheaper.Speculative decodingDraftVerifywin = accepted draft tokensRisk: low acceptance or extra memory overhead.QuantizationFP16INT4win = cheaper model stepRisk: quality loss or kernel mismatch.
Speculation changes how many target passes you need. Quantization changes the cost and quality of each pass.

They fail differently

Speculative decoding failure looks like this:

draft tokens proposed: 5 per step
accepted tokens: 1.1 per step
extra draft compute: non-trivial
result: complexity without much speedup

Quantization failure looks like this:

latency improved
quality on benchmark looked okay
tool-use JSON got worse
customer escalations increased
result: cheaper wrong answers

The first is a performance disappointment. The second can be a product incident.

What to measure

For speculative decoding:

  • draft tokens proposed
  • accepted tokens per verification
  • rejection rate by task type
  • target forward passes saved
  • TTFT and TPOT
  • memory overhead
  • quality parity
  • interaction with batching

For quantization:

  • quality on task-specific evals
  • exact match / pass rate / tool-call validity
  • calibration set coverage
  • throughput and latency by batch size
  • memory footprint
  • numerical stability
  • output drift
  • cost per useful token

For both:

  • p95 and p99 latency
  • cancellation behavior
  • retry rate
  • user correction rate
  • cost per accepted answer

Which one should you try first?

If the model barely fits in memory, try quantization first.

If decode is your bottleneck and quality is already acceptable, try speculative decoding.

If long prompts dominate, neither may be the first move. You may need prefix caching, chunked prefill, RAG compression, or KV-aware routing before either optimization pays back.

If output quality is fragile, be conservative with quantization. Speculative decoding can preserve the target distribution under the right algorithm, while quantization changes numerical behavior. That does not make speculation risk-free; it just moves the risk to implementation and acceptance rate.

Decision tree for speculative decoding and quantizationA decision tree guiding whether to prioritize quantization, speculative decoding, caching, or evaluation.Pick the lever that matches the bottleneckThere is no universal winner. The workload decides.What limits production?Memory fitTry quantization first.Decode speedTry speculation.Long prefillFix caching/routing.Quality riskBuild evals first.measure useful-token cost, not just benchmark tokens/sec
Optimization starts with bottleneck diagnosis. Otherwise you are turning knobs for entertainment.

Can you combine them?

Yes, but test the combined system, not each trick in isolation.

Common combinations:

  • quantized target model plus speculative decoding
  • quantized draft model plus full-precision target
  • FP8 target with draft head
  • speculative decoding plus prefix caching
  • quantized KV cache plus quantized weights

The combined risk is interaction. A lower-precision target may change acceptance behavior. A quantized draft may draft worse tokens. Batching may behave differently. Memory saved by quantization may be partly consumed by draft machinery.

The rollout plan should compare:

  1. baseline
  2. quantization only
  3. speculation only
  4. both together

Measure all four on the same traffic sample.

The production recommendation

Use speculative decoding when you can maintain high acceptance and your workload is decode-heavy.

Use quantization when memory footprint, bandwidth, or cost per token is the limiting factor and quality evals stay green.

Use both only when the combined eval shows real improvement in:

cost per accepted answer
latency inside SLO
quality on product-specific tasks
operational complexity you can actually support

The last line matters. A 20% speedup that creates a system nobody can debug at 2 AM is not a speedup. It is debt wearing running shoes.

Sources worth reading