Tensor Parallelism: Splitting One Layer Across Many GPUs

#tensor-parallelism #multi-gpu #megatron #nccl #llm-inference

When one transformer layer is too large for one GPU, tensor parallelism cuts the layer itself into pieces. Every token then crosses multiple GPUs during the same layer, turning fast matrix multiplication into a choreography of compute and collective communication.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

Imagine several chefs preparing one enormous sandwich. Each owns part of every layer, so they must exchange ingredients before the next layer can begin. More chefs add capacity, but the handoffs can become the meal.

Follow the state and work from left to right.

What actually happens

In a transformer MLP, a column-parallel linear layer splits output features across ranks. Each rank computes a slice. A following row-parallel layer consumes corresponding input slices and combines partial outputs with an all-reduce. Attention projections use related partitioning patterns.

Tensor parallelism reduces per-GPU weight and activation storage but introduces communication inside nearly every transformer layer. It therefore prefers high-bandwidth, low-latency links such as NVLink or NVSwitch and carefully formed process groups.

During autoregressive decode, matrices can be small in the batch dimension. Collective latency becomes visible because it repeats for every layer and token. A tensor-parallel degree chosen only to make the model fit may not be the degree that maximizes tokens per second.

A worked example

Split a 16,384-wide projection across four GPUs. Each rank stores and computes one quarter of the relevant columns. Before a dependent operation needs the complete result, ranks exchange or reduce partial activations. That saves weight memory per GPU but adds a collective on every token path.

The performance model

A simplified layer time is max or sum of local compute and collective time depending on overlap. Collective time includes a latency term plus bytes divided by effective link bandwidth. As local shards shrink, fixed communication latency occupies a larger fraction.

Expert lens

Topology-aware placement matters more than the TP number alone. TP across GPUs behind one NVSwitch is different from TP crossing PCIe hosts or a slower fabric. Keep latency-sensitive TP groups local and scale replicas across nodes when possible.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Layers or models that do not fit one accelerator
Large hidden dimensions with fast interconnect
Deployments where fewer larger replicas meet demand

Where it disappoints

Stretching TP across slow network links
Assuming twice the GPUs means twice the speed
Ignoring small-batch collective latency
Using shard shapes that disable optimized kernels

Production checklist

Map TP groups to the fastest local topology
Benchmark TP degrees with production batch sizes
Inspect collective overlap and NCCL traces
Confirm divisible hidden and head dimensions
Compare scale-up TP with scale-out replicas

What to measure

Collective time per layer and token
Link bandwidth and congestion
Per-rank compute imbalance
Tokens per second per GPU
TTFT and TPOT by TP degree

From one GPU to a production service

A workstation test sees one fixed set of GPUs. Kubernetes or a fleet scheduler must preserve the topology every time a replica starts. Device selection, rank order, NCCL configuration, shared memory, and health checks become part of the model deployment specification.

Replica sizing should compare scale-up and scale-out. TP8 may fit the model but offer fewer total tokens per GPU than two TP4 replicas. The right answer depends on memory, queueing, batch efficiency, and whether one replica meets the largest request.

Failure domains are wider than one pod. If any rank fails, the TP replica usually fails. Readiness should be collective, draining should stop new admission across all ranks, and restart policy should avoid partial zombie groups.

Design-review questions

What is the smallest TP degree that fits the worst request?
Are all ranks inside the intended fabric domain?
How does throughput per GPU change with TP degree?
What happens to in-flight requests when one rank fails?
Would more replicas beat a wider TP group?

How it connects to the rest of the series

Pipeline parallelism splits depth instead of tensors. Sequence parallelism reduces replicated activations around TP. Expert parallelism distributes MoE experts and adds all-to-all traffic.

From equation to implementation

Column parallelism computes Y_i = X A_i for column shards A_i, leaving output features sharded. Row parallelism consumes input shards X_i and computes partial X_i B_i, then sums across ranks. Alternating these patterns reduces unnecessary gathers inside a transformer block.

Collective cost follows topology and message size. Ring all-reduce roughly transfers 2(N-1)/N times the tensor bytes per rank, plus latency across steps. Tree or NVSwitch algorithms differ. The critical point is that decode repeats these collectives for every layer and token.

Implementation sketch

initialize_tp_group(ranks_on_same_nvlink_domain)
for transformer_layer:
    qkv_shards = column_parallel_linear(hidden)
    attention_shard = local_attention(qkv_shards)
    hidden = row_parallel_linear_and_all_reduce(attention_shard)
    mlp_shards = column_parallel_linear(hidden)
    hidden = row_parallel_linear_and_all_reduce(mlp_shards)

Capacity planning

Choose TP primarily to fit weights, KV shards, and workspace, then test the next smaller and larger degrees. More TP reduces bytes per rank but may shrink GEMMs below efficient tile sizes and add ranks to every collective.

Benchmarking without fooling yourself

Pin rank placement and record the exact topology.
Test batch-one decode and realistic continuous batches.
Capture NCCL traces and per-rank kernel gaps.
Report tokens per second per GPU, not aggregate alone.

A production failure to design for

A Kubernetes reschedule places half of a TP group across a slower inter-node link. The model remains healthy, but TPOT triples and collective time dominates. Enforce topology constraints and expose TP-group locality in readiness checks.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Tensor parallelism buys memory and compute with communication. The winning configuration is the smallest TP degree that fits and performs well on the actual topology.

Quantized Kernels: Why a 4-Bit Model Is Not Automatically Fast Pipeline Parallelism: Turning Model Depth into an Assembly Line