Skip to content
8/20 - Mixed Precision Inference: Spend Bits Where They Matter

8/20 - Mixed Precision Inference: Spend Bits Where They Matter

Precision is not one switch for an entire model. Weights, activations, accumulators, normalization, logits, and KV cache have different numerical needs. Mixed precision is the discipline of spending bits where error is dangerous and saving them where bandwidth dominates.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A map does not use the same resolution everywhere. A highway overview can be coarse; a surgical route through city streets needs detail. Mixed precision assigns numerical resolution to each operation instead of shrinking everything blindly.

MECHANISM FLOWMixed Precision Inference: request path01High-precision modelProfile tensor rangesChoose safe formats02Quantized executionScale and accumulateProtect sensitive ops03Validated outputCompare qualityMeasure real speedupINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with High-precision model, where profile tensor ranges. The middle stage, Quantized execution, scale and accumulate. The final stage, Validated output, shows the observable result: compare quality. The arrows describe dependency order, not necessarily separate services.

What actually happens

BF16 and FP16 store 16 bits but trade dynamic range differently. BF16 keeps an FP32-like exponent with fewer mantissa bits; FP16 offers more mantissa precision but a narrower range. FP8 introduces formats such as E4M3 and E5M2, plus per-tensor or block scaling.

Matrix inputs may use low precision while accumulation remains FP16 or FP32. Normalization, reductions, softmax, and final logits are often kept at higher precision. The optimal recipe depends on hardware instructions and the model’s tensor distributions.

Low-precision storage helps only when kernels consume that format efficiently. Reformatting tensors around unsupported operations can erase gains. Engine builders therefore search for coherent precision regions and insert quantize-dequantize boundaries carefully.

A worked example

A 70-billion-parameter model needs about 140 GB for BF16 weights and about 70 GB at one byte per weight before metadata and scales. That difference may change the deployment from four 40 GB GPUs to two 80 GB GPUs, but activation and KV formats still determine runtime capacity.

The performance model

Lower precision reduces memory traffic and can unlock higher Tensor Core throughput. Speedup is limited by unsupported layers, conversion overhead, small matrix shapes, and non-matrix work. Compare complete request latency and quality, not theoretical TOPS.

PHASE FITWhere Mixed precision changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityLowers matrix traffic and compute costDECODEOne new token per iterationWeight and KV bandwidth pressureReduces weight and KV bandwidthPROVE IT WITHTTFT, TPOT, saturation, qualityDEPLOYMENT DECISIONUse a phase-aware precision recipe
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Mixed precision changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

FP8 is not merely an integer quantizer with a decimal point. Scaling policy, amax history, format choice, and accumulation determine stability. On newer hardware, microscaling formats apply scales to smaller blocks, changing both accuracy and kernel layout.

TRADE-OFF MAPMixed Precision Inference: the tradeoffBASELINEUniform high precisionSimple numerical policyHigher bandwidth demandLarger model footprintBroad operator supportVSOPTIMIZEDMixed precisionFormat chosen per tensorScaling and casts requiredLower traffic and footprintNeeds hardware-aware validationMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Uniform high precision, characterized by simple numerical policy and higher bandwidth demand. The right panel applies Mixed precision, changing the cost profile to format chosen per tensor and scaling and casts required. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Bandwidth-bound transformer layers
  • Models that nearly exceed device memory
  • Hardware with native low-precision Tensor Cores

Where it disappoints

  • Using one dtype for every operation
  • Ignoring calibration and outlier tensors
  • Counting compressed bytes without kernel support
  • Validating only perplexity and not task behavior

Production checklist

  • Inventory formats for weights, activations, KV, and accumulators
  • Keep numerically sensitive reductions at safe precision
  • Inspect inserted casts and Q/DQ boundaries
  • Run task, safety, and long-context evaluations
  • Benchmark on the exact target accelerator

What to measure

  • Kernel time by precision
  • Cast and reformat overhead
  • Memory footprint by tensor class
  • Quality delta by task and sequence length
  • Fallback operations using higher precision

From one GPU to a production service

A single-model experiment can accept one global precision recipe. A model platform needs a compatibility matrix across architectures, GPUs, engine versions, and quality suites. The same label, such as FP8, can represent different scaling and accumulation behavior.

Build artifacts should record precision per layer and the reason for any fallback. Operators need to distinguish a deliberate BF16 normalization from an unsupported FP8 matrix that silently reduced expected throughput.

Quality gates should include deterministic numerical checks, task evaluations, safety behavior, long-context stress, and calibration drift. Precision is a rollout dimension like a new model, not a transparent storage change.

Design-review questions

  • Which operations remain at higher precision and why?
  • What scaling recipe and calibration corpus were used?
  • How many runtime reformats occur per request?
  • Does quality change with context length or language?
  • Can the engine fall back without violating capacity assumptions?

How it connects to the rest of the series

Quantized kernels determine whether low-bit formats are actually fast. Graph optimization can fuse precision boundaries. KV caching and offloading may use a different dtype from model weights.

From equation to implementation

Numerical error enters through rounding, clipping, underflow, and accumulated summation. Matrix multiplications often tolerate low-precision inputs because accumulators retain more precision. Softmax exponentials, normalization statistics, and small residual differences may require wider formats.

Precision propagation should be observable in the built engine. Quantize and dequantize boundaries, inserted reformats, and fallback layers define the actual recipe. A configuration that says FP8 can still execute large regions in FP16.

Implementation sketch

for tensor_group in calibration_set:
    collect_amax_and_outliers(tensor_group)
for op in graph:
    if op.is_low_precision_safe and kernel_supported(op.shape):
        set_input_precision(op, FP8)
        set_accumulator_precision(op, FP16_or_FP32)
    else:
        keep(op, BF16)
build_engine()
compare_outputs(reference, candidate, task_metrics)

Capacity planning

Weight bytes are easy to calculate; runtime memory is not. Add scales, duplicate packed layouts, activation buffers, KV dtype, engine tactics, and graph-capture pools. Validate with peak allocated memory under maximum concurrent shape.

Benchmarking without fooling yourself

  • Use both layer-wise error and end-task evaluation.
  • Sweep prompt length because activation ranges can shift.
  • Inspect real kernel precision and reformat time.
  • Compare accuracy and latency on the same hardware and engine profile.

A production failure to design for

A calibration set contains short English prompts only. A multilingual long-context workload produces larger activation outliers, saturating FP8 ranges in one attention block. Quality degrades only for that tenant. Calibration coverage must reflect sequence length, language, and modality.

OPERATING LOOPOperational loop1CalibrateRanges and outliersRepresentative traffic2BuildPrecision per opInspect reformats3ValidateTask and safety evalsLong-context checks4MonitorFallbacks and driftHardware-specific speedMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Calibrate to Build, then Validate and Monitor. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Mixed precision is a per-operation numerical contract, not a model-wide dtype switch. Weights, activations, accumulators, normalization statistics, logits, and KV state have different sensitivity and bandwidth costs. A robust plan assigns formats intentionally and inserts casts only at known boundaries.

A mixed-precision transformer pathLoad weightsFP8 or FP16 storageApply scale metadataCheck finite rangeMatmulLow-precision inputsWider accumulationFused epilogueSensitive opsNorm and softmaxUse safer precisionAvoid overflowOutputCast hidden stateProtect logitsRecord recipeThe precision recipe must travel with the engine and calibration artifacts.
Fast paths use narrow formats where error is bounded and wider formats where it compounds.

How to read this diagram: Follow the state from Load weights through Matmul and Sensitive ops to Output. Each box is an ownership or computation boundary. In particular, the precision recipe must travel with the engine and calibration artifacts. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

FP16 has limited exponent range; BF16 keeps FP32-like range with fewer mantissa bits. FP8 formats trade range and precision differently, so scaling history and amax collection become runtime state. Accumulation commonly uses a wider format because thousands of small products can lose information or overflow even when each input is representable.

Precision changes both bytes and numerical marginUniform FP16simple, 2 bytesMixed FP8/BF16less trafficNumerical riskBad scales cause saturation, underflow, or unstable logits.System gainLower bandwidth and larger effective memory capacity.
Actual speedup requires kernels that consume the chosen formats without cast overhead.

How to read this diagram: The bars compare Uniform FP16 with Mixed FP8/BF16 on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Lower bandwidth and larger effective memory capacity.”, remains larger than the risk, “Bad scales cause saturation, underflow, or unstable logits.”, under production traffic.

Calibration must cover real prompt lengths, activation outliers, adapters, and decoding phases. Prefill and decode can have different distributions. Per-tensor scaling is cheap but coarse; per-channel or block scaling reduces error at metadata and kernel cost. Evaluate task quality and logit divergence, not only perplexity.

Precision recipe lifecycleCalibratecollect rangesCompileselect kernelsValidatecompare qualityServemonitor driftAny model, adapter, kernel, or hardware change invalidates part of the recipe.
Precision metadata is versioned deployment state, not an offline suggestion.

How to read this diagram: State advances from Calibrate to Compile, Validate, and finally Serve. The labels below each state identify what becomes true at that boundary. The governing invariant is: Any model, adapter, kernel, or hardware change invalidates part of the recipe. Retries and cancellation must preserve the same transition rules.

Where precision decisions differWeightsCapacity and bandwidthOutlier-aware formatsActivationsDynamic range by layerPrefill versus decodeAccumulatorsReduction stabilityWider internal typeStateKV cache and logitsLong-horizon errorReport the effective dtype of each major operator in engine metadata.
A mixed-precision engine is a graph of numerical choices.

How to read this diagram: The four panels are independent review axes: Weights, Activations, Accumulators, and State. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Report the effective dtype of each major operator in engine metadata.

Scale drift can fail without NaNsTraffic shiftsNew activation outliersOld scales persistValues clipLayers stay finiteLogits subtly moveQuality fallsAnswers degradeHealth stays greenControlShadow wider precisionRoll back recipeMonitor saturation and task quality because numerical failures may remain finite.
Mixed precision needs semantic canaries in addition to runtime error checks.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Traffic shifts triggers Values clip, which creates Quality falls. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Mixed precision is a numerical architecture. The goal is not the smallest dtype; it is the cheapest validated path through the model.