8/20 - Mixed Precision Inference: Spend Bits Where They Matter

#mixed-precision #fp8 #bf16 #inference #tensor-cores

Precision is not one switch for an entire model. Weights, activations, accumulators, normalization, logits, and KV cache have different numerical needs. Mixed precision is the discipline of spending bits where error is dangerous and saving them where bandwidth dominates.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A map does not use the same resolution everywhere. A highway overview can be coarse; a surgical route through city streets needs detail. Mixed precision assigns numerical resolution to each operation instead of shrinking everything blindly.

Follow the state and work from left to right.

Description: Start with High-precision model, where profile tensor ranges. The middle stage, Quantized execution, scale and accumulate. The final stage, Validated output, shows the observable result: compare quality. The arrows describe dependency order, not necessarily separate services.

What actually happens

BF16 and FP16 store 16 bits but trade dynamic range differently. BF16 keeps an FP32-like exponent with fewer mantissa bits; FP16 offers more mantissa precision but a narrower range. FP8 introduces formats such as E4M3 and E5M2, plus per-tensor or block scaling.

Matrix inputs may use low precision while accumulation remains FP16 or FP32. Normalization, reductions, softmax, and final logits are often kept at higher precision. The optimal recipe depends on hardware instructions and the model’s tensor distributions.

Low-precision storage helps only when kernels consume that format efficiently. Reformatting tensors around unsupported operations can erase gains. Engine builders therefore search for coherent precision regions and insert quantize-dequantize boundaries carefully.

A worked example

A 70-billion-parameter model needs about 140 GB for BF16 weights and about 70 GB at one byte per weight before metadata and scales. That difference may change the deployment from four 40 GB GPUs to two 80 GB GPUs, but activation and KV formats still determine runtime capacity.

The performance model

Lower precision reduces memory traffic and can unlock higher Tensor Core throughput. Speedup is limited by unsupported layers, conversion overhead, small matrix shapes, and non-matrix work. Compare complete request latency and quality, not theoretical TOPS.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Mixed precision changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

FP8 is not merely an integer quantizer with a decimal point. Scaling policy, amax history, format choice, and accumulation determine stability. On newer hardware, microscaling formats apply scales to smaller blocks, changing both accuracy and kernel layout.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Uniform high precision, characterized by simple numerical policy and higher bandwidth demand. The right panel applies Mixed precision, changing the cost profile to format chosen per tensor and scaling and casts required. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Bandwidth-bound transformer layers
Models that nearly exceed device memory
Hardware with native low-precision Tensor Cores

Where it disappoints

Using one dtype for every operation
Ignoring calibration and outlier tensors
Counting compressed bytes without kernel support
Validating only perplexity and not task behavior

Production checklist

Inventory formats for weights, activations, KV, and accumulators
Keep numerically sensitive reductions at safe precision
Inspect inserted casts and Q/DQ boundaries
Run task, safety, and long-context evaluations
Benchmark on the exact target accelerator

What to measure

Kernel time by precision
Cast and reformat overhead
Memory footprint by tensor class
Quality delta by task and sequence length
Fallback operations using higher precision

From one GPU to a production service

A single-model experiment can accept one global precision recipe. A model platform needs a compatibility matrix across architectures, GPUs, engine versions, and quality suites. The same label, such as FP8, can represent different scaling and accumulation behavior.

Build artifacts should record precision per layer and the reason for any fallback. Operators need to distinguish a deliberate BF16 normalization from an unsupported FP8 matrix that silently reduced expected throughput.

Quality gates should include deterministic numerical checks, task evaluations, safety behavior, long-context stress, and calibration drift. Precision is a rollout dimension like a new model, not a transparent storage change.

Design-review questions

Which operations remain at higher precision and why?
What scaling recipe and calibration corpus were used?
How many runtime reformats occur per request?
Does quality change with context length or language?
Can the engine fall back without violating capacity assumptions?

How it connects to the rest of the series

Quantized kernels determine whether low-bit formats are actually fast. Graph optimization can fuse precision boundaries. KV caching and offloading may use a different dtype from model weights.

From equation to implementation

Numerical error enters through rounding, clipping, underflow, and accumulated summation. Matrix multiplications often tolerate low-precision inputs because accumulators retain more precision. Softmax exponentials, normalization statistics, and small residual differences may require wider formats.

Precision propagation should be observable in the built engine. Quantize and dequantize boundaries, inserted reformats, and fallback layers define the actual recipe. A configuration that says FP8 can still execute large regions in FP16.

Implementation sketch

for tensor_group in calibration_set:
    collect_amax_and_outliers(tensor_group)
for op in graph:
    if op.is_low_precision_safe and kernel_supported(op.shape):
        set_input_precision(op, FP8)
        set_accumulator_precision(op, FP16_or_FP32)
    else:
        keep(op, BF16)
build_engine()
compare_outputs(reference, candidate, task_metrics)

Capacity planning

Weight bytes are easy to calculate; runtime memory is not. Add scales, duplicate packed layouts, activation buffers, KV dtype, engine tactics, and graph-capture pools. Validate with peak allocated memory under maximum concurrent shape.

Benchmarking without fooling yourself

Use both layer-wise error and end-task evaluation.
Sweep prompt length because activation ranges can shift.
Inspect real kernel precision and reformat time.
Compare accuracy and latency on the same hardware and engine profile.

A production failure to design for

A calibration set contains short English prompts only. A multilingual long-context workload produces larger activation outliers, saturating FP8 ranges in one attention block. Quality degrades only for that tenant. Calibration coverage must reflect sequence length, language, and modality.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Calibrate to Build, then Validate and Monitor. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Mixed precision is a per-operation numerical contract, not a model-wide dtype switch. Weights, activations, accumulators, normalization statistics, logits, and KV state have different sensitivity and bandwidth costs. A robust plan assigns formats intentionally and inserts casts only at known boundaries.

Fast paths use narrow formats where error is bounded and wider formats where it compounds.

Description: Follow the state from Load weights through Matmul and Sensitive ops to Output. Each box is an ownership or computation boundary. In particular, the precision recipe must travel with the engine and calibration artifacts. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

FP16 has limited exponent range; BF16 keeps FP32-like range with fewer mantissa bits. FP8 formats trade range and precision differently, so scaling history and amax collection become runtime state. Accumulation commonly uses a wider format because thousands of small products can lose information or overflow even when each input is representable.

Actual speedup requires kernels that consume the chosen formats without cast overhead.

Description: The bars compare Uniform FP16 with Mixed FP8/BF16 on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Lower bandwidth and larger effective memory capacity.”, remains larger than the risk, “Bad scales cause saturation, underflow, or unstable logits.”, under production traffic.

Calibration must cover real prompt lengths, activation outliers, adapters, and decoding phases. Prefill and decode can have different distributions. Per-tensor scaling is cheap but coarse; per-channel or block scaling reduces error at metadata and kernel cost. Evaluate task quality and logit divergence, not only perplexity.

Precision metadata is versioned deployment state, not an offline suggestion.

Description: State advances from Calibrate to Compile, Validate, and finally Serve. The labels below each state identify what becomes true at that boundary. The governing invariant is: Any model, adapter, kernel, or hardware change invalidates part of the recipe. Retries and cancellation must preserve the same transition rules.

A mixed-precision engine is a graph of numerical choices.

Description: The four panels are independent review axes: Weights, Activations, Accumulators, and State. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Report the effective dtype of each major operator in engine metadata.

Mixed precision needs semantic canaries in addition to runtime error checks.

Description: This is a causal chain, not four unrelated symptoms. Traffic shifts triggers Values clip, which creates Quality falls. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Mixed precision is a numerical architecture. The goal is not the smallest dtype; it is the cheapest validated path through the model.

7/20 - Parallel Decoding: Predicting More Than One Future at a Time 9/20 - Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast