Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast

#quantization #kernels #awq #int4 #llm-inference

A checkpoint can be four bits on disk and still run slowly. Compression changes storage. A quantized kernel determines whether the GPU can multiply packed values, apply scales, accumulate safely, and avoid unpacking the advantage away.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Vacuum-packing clothes makes a suitcase smaller. It does not make getting dressed faster if every shirt must be unpacked, ironed, and repacked before use. Quantized kernels are the machinery that uses compressed values efficiently.

Follow the state and work from left to right.

What actually happens

Weight-only kernels keep activations in FP16 or BF16 and store weights as INT4 or INT8. They load packed weights, apply per-group scales and zero points, and multiply without materializing a full high-precision weight matrix in HBM.

Weight-and-activation schemes such as W8A8 can use native integer matrix instructions, but activation outliers make calibration harder. SmoothQuant moves difficulty from activations into weights through an equivalent scaling transformation.

Group size controls metadata and error. Smaller groups adapt scales more precisely but require more scale loads and may reduce kernel efficiency. Layout, packing order, tile size, and supported matrix shapes determine whether the GPU reaches useful throughput.

A worked example

A BF16 matrix contains two bytes per weight; INT4 contains half a byte plus scales. A bandwidth-bound decode GEMM can read roughly one quarter the weight bytes. But if the runtime expands INT4 to BF16 in a separate kernel, it writes and rereads a large intermediate and may lose most of the benefit.

The performance model

Quantization helps most when weight movement dominates, especially low-batch decode. At large prefill batches, the operation may become compute-bound and the relative gain can shrink. Kernel choice must be evaluated separately for prefill and decode.

Expert lens

Quality and speed are coupled through layout. An accurate quantizer whose format lacks a production kernel may be less useful than a slightly different scheme with fused, hardware-tuned support. Deployment format is part of model selection.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Low-batch memory-bound decode
Models limited by weight capacity
Hardware and shapes with mature low-bit kernels

Where it disappoints

Equating model size reduction with latency reduction
Ignoring group-scale and zero-point overhead
Using calibration data unlike production traffic
Falling back silently to unfused dequantization

Production checklist

Confirm the exact kernel selected at runtime
Match quantization layout to serving engine
Evaluate prefill and decode separately
Validate long-context and domain-specific quality
Profile dequantization and cast kernels

What to measure

Effective weight bandwidth
Fused versus fallback kernel count
Tokens per second by batch size
Quantization metadata bytes
Quality delta by task

From one GPU to a production service

A local benchmark often loads one quantized format into its native runtime. A production platform receives checkpoints from many quantizers. Standardize a small set of serving formats or convert offline into an engine-owned canonical layout.

Model onboarding should fail fast when a required kernel, group size, or shape is unsupported. Silent fallback is dangerous because the compressed checkpoint may no longer fit once expanded into a temporary or resident high-precision layout.

Multi-GPU serving adds shard alignment. Quantization groups should divide cleanly across tensor-parallel partitions, and collective outputs must use a precision that preserves the intended error budget.

Design-review questions

Is dequantization fused into the matrix kernel?
What packed layout is resident in HBM?
Do quantization groups align with TP shards?
Which shapes trigger a fallback implementation?
Is speedup measured at the same quality and batch distribution?

How it connects to the rest of the series

Mixed precision defines the larger numerical recipe. Graph optimization can fuse Q/DQ boundaries. Tensor parallelism changes quantized shard shapes and communication volume.

From equation to implementation

A weight-only INT4 GEMM typically loads packed nibbles and a scale per group, converts fragments in registers or uses a native low-bit path, and accumulates into FP16 or FP32. The arithmetic intensity changes because weight bytes fall while activation bytes remain.

Quantization layout is kernel ABI. Group order, interleaving, signed representation, scale dtype, and tile arrangement must match the runtime. Converting between two 4-bit layouts at startup may be acceptable; converting per request is not.

Implementation sketch

offline:
    scales = calibrate_per_group(weights, activations)
    packed = quantize_and_pack(weights, scales, kernel_layout)
runtime:
    for tile in packed:
        load_packed_weight_fragment(tile)
        load_group_scales(tile)
        dequantize_in_registers()
        mma_accumulate(activation_fragment)
    write_high_precision_output()

Capacity planning

Include scale overhead: INT4 with one FP16 scale per group of 128 weights adds about 0.125 bits per weight, before zero points and alignment. Small groups improve fidelity but increase metadata and scale traffic.

Benchmarking without fooling yourself

Measure cold startup conversion separately from steady state.
Sweep batch and sequence shapes because bottlenecks move.
Profile packed-weight bandwidth and fallback kernels.
Compare quality on calibration-adjacent and out-of-domain tasks.

A production failure to design for

A deployment accepts an AWQ checkpoint but the selected GPU lacks the expected fused kernel. The engine silently dequantizes into a temporary BF16 buffer. Memory spikes and latency exceeds the original BF16 model. Fail validation when the required kernel path is unavailable.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Quantization is only an acceleration technique when the data format, kernel, hardware, and workload agree.

Mixed Precision Inference: Spend Bits Where They Matter Tensor Parallelism: Splitting One Layer Across Many GPUs