Mixed Precision Inference: Spend Bits Where They Matter

#mixed-precision #fp8 #bf16 #inference #tensor-cores

Precision is not one switch for an entire model. Weights, activations, accumulators, normalization, logits, and KV cache have different numerical needs. Mixed precision is the discipline of spending bits where error is dangerous and saving them where bandwidth dominates.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A map does not use the same resolution everywhere. A highway overview can be coarse; a surgical route through city streets needs detail. Mixed precision assigns numerical resolution to each operation instead of shrinking everything blindly.

Follow the state and work from left to right.

What actually happens

BF16 and FP16 store 16 bits but trade dynamic range differently. BF16 keeps an FP32-like exponent with fewer mantissa bits; FP16 offers more mantissa precision but a narrower range. FP8 introduces formats such as E4M3 and E5M2, plus per-tensor or block scaling.

Matrix inputs may use low precision while accumulation remains FP16 or FP32. Normalization, reductions, softmax, and final logits are often kept at higher precision. The optimal recipe depends on hardware instructions and the model’s tensor distributions.

Low-precision storage helps only when kernels consume that format efficiently. Reformatting tensors around unsupported operations can erase gains. Engine builders therefore search for coherent precision regions and insert quantize-dequantize boundaries carefully.

A worked example

A 70-billion-parameter model needs about 140 GB for BF16 weights and about 70 GB at one byte per weight before metadata and scales. That difference may change the deployment from four 40 GB GPUs to two 80 GB GPUs, but activation and KV formats still determine runtime capacity.

The performance model

Lower precision reduces memory traffic and can unlock higher Tensor Core throughput. Speedup is limited by unsupported layers, conversion overhead, small matrix shapes, and non-matrix work. Compare complete request latency and quality, not theoretical TOPS.

Expert lens

FP8 is not merely an integer quantizer with a decimal point. Scaling policy, amax history, format choice, and accumulation determine stability. On newer hardware, microscaling formats apply scales to smaller blocks, changing both accuracy and kernel layout.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Bandwidth-bound transformer layers
Models that nearly exceed device memory
Hardware with native low-precision Tensor Cores

Where it disappoints

Using one dtype for every operation
Ignoring calibration and outlier tensors
Counting compressed bytes without kernel support
Validating only perplexity and not task behavior

Production checklist

Inventory formats for weights, activations, KV, and accumulators
Keep numerically sensitive reductions at safe precision
Inspect inserted casts and Q/DQ boundaries
Run task, safety, and long-context evaluations
Benchmark on the exact target accelerator

What to measure

Kernel time by precision
Cast and reformat overhead
Memory footprint by tensor class
Quality delta by task and sequence length
Fallback operations using higher precision

From one GPU to a production service

A single-model experiment can accept one global precision recipe. A model platform needs a compatibility matrix across architectures, GPUs, engine versions, and quality suites. The same label, such as FP8, can represent different scaling and accumulation behavior.

Build artifacts should record precision per layer and the reason for any fallback. Operators need to distinguish a deliberate BF16 normalization from an unsupported FP8 matrix that silently reduced expected throughput.

Quality gates should include deterministic numerical checks, task evaluations, safety behavior, long-context stress, and calibration drift. Precision is a rollout dimension like a new model, not a transparent storage change.

Design-review questions

Which operations remain at higher precision and why?
What scaling recipe and calibration corpus were used?
How many runtime reformats occur per request?
Does quality change with context length or language?
Can the engine fall back without violating capacity assumptions?

How it connects to the rest of the series

Quantized kernels determine whether low-bit formats are actually fast. Graph optimization can fuse precision boundaries. KV caching and offloading may use a different dtype from model weights.

From equation to implementation

Numerical error enters through rounding, clipping, underflow, and accumulated summation. Matrix multiplications often tolerate low-precision inputs because accumulators retain more precision. Softmax exponentials, normalization statistics, and small residual differences may require wider formats.

Precision propagation should be observable in the built engine. Quantize and dequantize boundaries, inserted reformats, and fallback layers define the actual recipe. A configuration that says FP8 can still execute large regions in FP16.

Implementation sketch

for tensor_group in calibration_set:
    collect_amax_and_outliers(tensor_group)
for op in graph:
    if op.is_low_precision_safe and kernel_supported(op.shape):
        set_input_precision(op, FP8)
        set_accumulator_precision(op, FP16_or_FP32)
    else:
        keep(op, BF16)
build_engine()
compare_outputs(reference, candidate, task_metrics)

Capacity planning

Weight bytes are easy to calculate; runtime memory is not. Add scales, duplicate packed layouts, activation buffers, KV dtype, engine tactics, and graph-capture pools. Validate with peak allocated memory under maximum concurrent shape.

Benchmarking without fooling yourself

Use both layer-wise error and end-task evaluation.
Sweep prompt length because activation ranges can shift.
Inspect real kernel precision and reformat time.
Compare accuracy and latency on the same hardware and engine profile.

A production failure to design for

A calibration set contains short English prompts only. A multilingual long-context workload produces larger activation outliers, saturating FP8 ranges in one attention block. Quality degrades only for that tenant. Calibration coverage must reflect sequence length, language, and modality.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Mixed precision is a numerical architecture. The goal is not the smallest dtype; it is the cheapest validated path through the model.

Parallel Decoding: Predicting More Than One Future at a Time Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast