Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production
Speculative decoding and quantization both promise faster inference. They do it in completely different ways.
Speculative decoding tries to reduce the number of expensive target-model decode steps. Quantization tries to make each step cheaper by using smaller numerical formats and moving less memory.
One changes the algorithmic schedule. The other changes representation and kernels.
The best production systems often use both, but not because “more optimization” is automatically better. They use both when the workload, model, quality target, and serving runtime agree.
What speculative decoding optimizes
Autoregressive decoding is sequential. Generate a token, append it, generate the next token. Speculative decoding adds a draft path:
- A cheaper draft model or draft head predicts several tokens.
- The target model verifies those tokens.
- Accepted tokens are emitted.
- Rejected tokens fall back safely.
The original speculative decoding work showed that, under the algorithm’s assumptions, this can preserve the target distribution while reducing target-model passes. That is the magic: speed without changing the answer distribution. Production still has to measure it carefully because implementation details, sampling, model mismatch, and workload shape matter.
Speculative decoding helps most when:
- decode dominates latency
- output is long enough
- draft model is cheap
- acceptance rate is high
- runtime verifies efficiently
- memory overhead is acceptable
It helps least when:
- prefill dominates
- output is short
- draft tokens are often rejected
- draft model consumes scarce GPU memory
- tool latency dominates the workflow
What quantization optimizes
Quantization changes numeric precision. Instead of FP16/BF16 everywhere, you may use FP8, INT8, INT4, AWQ, GPTQ, SmoothQuant-style approaches, or runtime-specific formats.
The goals:
- reduce model memory footprint
- reduce memory bandwidth pressure
- increase effective throughput
- fit larger models or longer contexts
- lower cost per token
The tradeoff is quality and compatibility. Some models tolerate lower precision well. Some tasks expose degradation quickly. Some kernels are fast only for certain shapes, GPUs, and batch sizes. NVIDIA’s TensorRT-LLM, vLLM, and SGLang all support quantization paths, but the practical win depends on the exact model, hardware, and runtime.
They fail differently
Speculative decoding failure looks like this:
draft tokens proposed: 5 per step
accepted tokens: 1.1 per step
extra draft compute: non-trivial
result: complexity without much speedupQuantization failure looks like this:
latency improved
quality on benchmark looked okay
tool-use JSON got worse
customer escalations increased
result: cheaper wrong answersThe first is a performance disappointment. The second can be a product incident.
What to measure
For speculative decoding:
- draft tokens proposed
- accepted tokens per verification
- rejection rate by task type
- target forward passes saved
- TTFT and TPOT
- memory overhead
- quality parity
- interaction with batching
For quantization:
- quality on task-specific evals
- exact match / pass rate / tool-call validity
- calibration set coverage
- throughput and latency by batch size
- memory footprint
- numerical stability
- output drift
- cost per useful token
For both:
- p95 and p99 latency
- cancellation behavior
- retry rate
- user correction rate
- cost per accepted answer
Which one should you try first?
If the model barely fits in memory, try quantization first.
If decode is your bottleneck and quality is already acceptable, try speculative decoding.
If long prompts dominate, neither may be the first move. You may need prefix caching, chunked prefill, RAG compression, or KV-aware routing before either optimization pays back.
If output quality is fragile, be conservative with quantization. Speculative decoding can preserve the target distribution under the right algorithm, while quantization changes numerical behavior. That does not make speculation risk-free; it just moves the risk to implementation and acceptance rate.
Can you combine them?
Yes, but test the combined system, not each trick in isolation.
Common combinations:
- quantized target model plus speculative decoding
- quantized draft model plus full-precision target
- FP8 target with draft head
- speculative decoding plus prefix caching
- quantized KV cache plus quantized weights
The combined risk is interaction. A lower-precision target may change acceptance behavior. A quantized draft may draft worse tokens. Batching may behave differently. Memory saved by quantization may be partly consumed by draft machinery.
The rollout plan should compare:
- baseline
- quantization only
- speculation only
- both together
Measure all four on the same traffic sample.
The production recommendation
Use speculative decoding when you can maintain high acceptance and your workload is decode-heavy.
Use quantization when memory footprint, bandwidth, or cost per token is the limiting factor and quality evals stay green.
Use both only when the combined eval shows real improvement in:
cost per accepted answer
latency inside SLO
quality on product-specific tasks
operational complexity you can actually supportThe last line matters. A 20% speedup that creates a system nobody can debug at 2 AM is not a speedup. It is debt wearing running shoes.
Sources worth reading
- Speculative decoding paper for the distribution-preserving algorithmic idea.
- vLLM speculative decoding docs.
- TensorRT-LLM speculative decoding docs.
- SGLang speculative decoding docs.
- TensorRT-LLM quantization docs.
- AWQ paper, GPTQ paper, and SmoothQuant paper for quantization background.
