Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast
A checkpoint can be four bits on disk and still run slowly. Compression changes storage. A quantized kernel determines whether the GPU can multiply packed values, apply scales, accumulate safely, and avoid unpacking the advantage away.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
Vacuum-packing clothes makes a suitcase smaller. It does not make getting dressed faster if every shirt must be unpacked, ironed, and repacked before use. Quantized kernels are the machinery that uses compressed values efficiently.
What actually happens
Weight-only kernels keep activations in FP16 or BF16 and store weights as INT4 or INT8. They load packed weights, apply per-group scales and zero points, and multiply without materializing a full high-precision weight matrix in HBM.
Weight-and-activation schemes such as W8A8 can use native integer matrix instructions, but activation outliers make calibration harder. SmoothQuant moves difficulty from activations into weights through an equivalent scaling transformation.
Group size controls metadata and error. Smaller groups adapt scales more precisely but require more scale loads and may reduce kernel efficiency. Layout, packing order, tile size, and supported matrix shapes determine whether the GPU reaches useful throughput.
A worked example
A BF16 matrix contains two bytes per weight; INT4 contains half a byte plus scales. A bandwidth-bound decode GEMM can read roughly one quarter the weight bytes. But if the runtime expands INT4 to BF16 in a separate kernel, it writes and rereads a large intermediate and may lose most of the benefit.
The performance model
Quantization helps most when weight movement dominates, especially low-batch decode. At large prefill batches, the operation may become compute-bound and the relative gain can shrink. Kernel choice must be evaluated separately for prefill and decode.
Expert lens
Quality and speed are coupled through layout. An accurate quantizer whose format lacks a production kernel may be less useful than a slightly different scheme with fused, hardware-tuned support. Deployment format is part of model selection.
Where it wins
- Low-batch memory-bound decode
- Models limited by weight capacity
- Hardware and shapes with mature low-bit kernels
Where it disappoints
- Equating model size reduction with latency reduction
- Ignoring group-scale and zero-point overhead
- Using calibration data unlike production traffic
- Falling back silently to unfused dequantization
Production checklist
- Confirm the exact kernel selected at runtime
- Match quantization layout to serving engine
- Evaluate prefill and decode separately
- Validate long-context and domain-specific quality
- Profile dequantization and cast kernels
What to measure
- Effective weight bandwidth
- Fused versus fallback kernel count
- Tokens per second by batch size
- Quantization metadata bytes
- Quality delta by task
From one GPU to a production service
A local benchmark often loads one quantized format into its native runtime. A production platform receives checkpoints from many quantizers. Standardize a small set of serving formats or convert offline into an engine-owned canonical layout.
Model onboarding should fail fast when a required kernel, group size, or shape is unsupported. Silent fallback is dangerous because the compressed checkpoint may no longer fit once expanded into a temporary or resident high-precision layout.
Multi-GPU serving adds shard alignment. Quantization groups should divide cleanly across tensor-parallel partitions, and collective outputs must use a precision that preserves the intended error budget.
Design-review questions
- Is dequantization fused into the matrix kernel?
- What packed layout is resident in HBM?
- Do quantization groups align with TP shards?
- Which shapes trigger a fallback implementation?
- Is speedup measured at the same quality and batch distribution?
How it connects to the rest of the series
Mixed precision defines the larger numerical recipe. Graph optimization can fuse Q/DQ boundaries. Tensor parallelism changes quantized shard shapes and communication volume.
From equation to implementation
A weight-only INT4 GEMM typically loads packed nibbles and a scale per group, converts fragments in registers or uses a native low-bit path, and accumulates into FP16 or FP32. The arithmetic intensity changes because weight bytes fall while activation bytes remain.
Quantization layout is kernel ABI. Group order, interleaving, signed representation, scale dtype, and tile arrangement must match the runtime. Converting between two 4-bit layouts at startup may be acceptable; converting per request is not.
Implementation sketch
offline:
scales = calibrate_per_group(weights, activations)
packed = quantize_and_pack(weights, scales, kernel_layout)
runtime:
for tile in packed:
load_packed_weight_fragment(tile)
load_group_scales(tile)
dequantize_in_registers()
mma_accumulate(activation_fragment)
write_high_precision_output()Capacity planning
Include scale overhead: INT4 with one FP16 scale per group of 128 weights adds about 0.125 bits per weight, before zero points and alignment. Small groups improve fidelity but increase metadata and scale traffic.
Benchmarking without fooling yourself
- Measure cold startup conversion separately from steady state.
- Sweep batch and sequence shapes because bottlenecks move.
- Profile packed-weight bandwidth and fallback kernels.
- Compare quality on calibration-adjacent and out-of-domain tasks.
A production failure to design for
A deployment accepts an AWQ checkpoint but the selected GPU lacks the expected fused kernel. The engine silently dequantizes into a temporary BF16 buffer. Memory spikes and latency exceeds the original BF16 model. Fail validation when the required kernel path is unavailable.
Primary references
The takeaway
Quantization is only an acceleration technique when the data format, kernel, hardware, and workload agree.
