Draft Tokens or Smaller Numbers? Speculative Decoding vs Quantization in Production

#ai #inference #llm #speculative-decoding #quantization #performance #gpu

Speculative decoding and quantization both promise faster inference. They do it in completely different ways.

Speculative decoding tries to reduce the number of expensive target-model decode steps. Quantization tries to make each step cheaper by using smaller numerical formats and moving less memory.

One changes the algorithmic schedule. The other changes representation and kernels.

The best production systems often use both, but not because “more optimization” is automatically better. They use both when the workload, model, quality target, and serving runtime agree.

What speculative decoding optimizes

Autoregressive decoding is sequential. Generate a token, append it, generate the next token. Speculative decoding adds a draft path:

A cheaper draft model or draft head predicts several tokens.
The target model verifies those tokens.
Accepted tokens are emitted.
Rejected tokens fall back safely.

The original speculative decoding work showed that, under the algorithm’s assumptions, this can preserve the target distribution while reducing target-model passes. That is the magic: speed without changing the answer distribution. Production still has to measure it carefully because implementation details, sampling, model mismatch, and workload shape matter.

Speculative decoding helps most when:

decode dominates latency
output is long enough
draft model is cheap
acceptance rate is high
runtime verifies efficiently
memory overhead is acceptable

It helps least when:

prefill dominates
output is short
draft tokens are often rejected
draft model consumes scarce GPU memory
tool latency dominates the workflow

What quantization optimizes

Quantization changes numeric precision. Instead of FP16/BF16 everywhere, you may use FP8, INT8, INT4, AWQ, GPTQ, SmoothQuant-style approaches, or runtime-specific formats.

The goals:

reduce model memory footprint
reduce memory bandwidth pressure
increase effective throughput
fit larger models or longer contexts
lower cost per token

The tradeoff is quality and compatibility. Some models tolerate lower precision well. Some tasks expose degradation quickly. Some kernels are fast only for certain shapes, GPUs, and batch sizes. NVIDIA’s TensorRT-LLM, vLLM, and SGLang all support quantization paths, but the practical win depends on the exact model, hardware, and runtime.

Speculation changes how many target passes you need. Quantization changes the cost and quality of each pass.

They fail differently

Speculative decoding failure looks like this:

draft tokens proposed: 5 per step
accepted tokens: 1.1 per step
extra draft compute: non-trivial
result: complexity without much speedup

Quantization failure looks like this:

latency improved
quality on benchmark looked okay
tool-use JSON got worse
customer escalations increased
result: cheaper wrong answers

The first is a performance disappointment. The second can be a product incident.

What to measure

For speculative decoding:

draft tokens proposed
accepted tokens per verification
rejection rate by task type
target forward passes saved
TTFT and TPOT
memory overhead
quality parity
interaction with batching

For quantization:

quality on task-specific evals
exact match / pass rate / tool-call validity
calibration set coverage
throughput and latency by batch size
memory footprint
numerical stability
output drift
cost per useful token

For both:

p95 and p99 latency
cancellation behavior
retry rate
user correction rate
cost per accepted answer

Which one should you try first?

If the model barely fits in memory, try quantization first.

If decode is your bottleneck and quality is already acceptable, try speculative decoding.

If long prompts dominate, neither may be the first move. You may need prefix caching, chunked prefill, RAG compression, or KV-aware routing before either optimization pays back.

If output quality is fragile, be conservative with quantization. Speculative decoding can preserve the target distribution under the right algorithm, while quantization changes numerical behavior. That does not make speculation risk-free; it just moves the risk to implementation and acceptance rate.

Optimization starts with bottleneck diagnosis. Otherwise you are turning knobs for entertainment.

Can you combine them?

Yes, but test the combined system, not each trick in isolation.

Common combinations:

quantized target model plus speculative decoding
quantized draft model plus full-precision target
FP8 target with draft head
speculative decoding plus prefix caching
quantized KV cache plus quantized weights

The combined risk is interaction. A lower-precision target may change acceptance behavior. A quantized draft may draft worse tokens. Batching may behave differently. Memory saved by quantization may be partly consumed by draft machinery.

The rollout plan should compare:

baseline
quantization only
speculation only
both together

Measure all four on the same traffic sample.

The production recommendation

Use speculative decoding when you can maintain high acceptance and your workload is decode-heavy.

Use quantization when memory footprint, bandwidth, or cost per token is the limiting factor and quality evals stay green.

Use both only when the combined eval shows real improvement in:

cost per accepted answer
latency inside SLO
quality on product-specific tasks
operational complexity you can actually support

The last line matters. A 20% speedup that creates a system nobody can debug at 2 AM is not a speedup. It is debt wearing running shoes.

Sources worth reading

Speculative decoding paper for the distribution-preserving algorithmic idea.
vLLM speculative decoding docs.
TensorRT-LLM speculative decoding docs.
SGLang speculative decoding docs.
TensorRT-LLM quantization docs.
AWQ paper, GPTQ paper, and SmoothQuant paper for quantization background.

KV Cache at Fleet Scale: The Memory System Hiding Inside Every LLM Platform Agentic AI Needs Smarter Inference: Hints, Priority, and Cache Lifecycle