Tensor Parallelism: Splitting One Layer Across Many GPUs
When one transformer layer is too large for one GPU, tensor parallelism cuts the layer itself into pieces. Every token then crosses multiple GPUs during the same layer, turning fast matrix multiplication into a choreography of compute and collective communication.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
Imagine several chefs preparing one enormous sandwich. Each owns part of every layer, so they must exchange ingredients before the next layer can begin. More chefs add capacity, but the handoffs can become the meal.
What actually happens
In a transformer MLP, a column-parallel linear layer splits output features across ranks. Each rank computes a slice. A following row-parallel layer consumes corresponding input slices and combines partial outputs with an all-reduce. Attention projections use related partitioning patterns.
Tensor parallelism reduces per-GPU weight and activation storage but introduces communication inside nearly every transformer layer. It therefore prefers high-bandwidth, low-latency links such as NVLink or NVSwitch and carefully formed process groups.
During autoregressive decode, matrices can be small in the batch dimension. Collective latency becomes visible because it repeats for every layer and token. A tensor-parallel degree chosen only to make the model fit may not be the degree that maximizes tokens per second.
A worked example
Split a 16,384-wide projection across four GPUs. Each rank stores and computes one quarter of the relevant columns. Before a dependent operation needs the complete result, ranks exchange or reduce partial activations. That saves weight memory per GPU but adds a collective on every token path.
The performance model
A simplified layer time is max or sum of local compute and collective time depending on overlap. Collective time includes a latency term plus bytes divided by effective link bandwidth. As local shards shrink, fixed communication latency occupies a larger fraction.
Expert lens
Topology-aware placement matters more than the TP number alone. TP across GPUs behind one NVSwitch is different from TP crossing PCIe hosts or a slower fabric. Keep latency-sensitive TP groups local and scale replicas across nodes when possible.
Where it wins
- Layers or models that do not fit one accelerator
- Large hidden dimensions with fast interconnect
- Deployments where fewer larger replicas meet demand
Where it disappoints
- Stretching TP across slow network links
- Assuming twice the GPUs means twice the speed
- Ignoring small-batch collective latency
- Using shard shapes that disable optimized kernels
Production checklist
- Map TP groups to the fastest local topology
- Benchmark TP degrees with production batch sizes
- Inspect collective overlap and NCCL traces
- Confirm divisible hidden and head dimensions
- Compare scale-up TP with scale-out replicas
What to measure
- Collective time per layer and token
- Link bandwidth and congestion
- Per-rank compute imbalance
- Tokens per second per GPU
- TTFT and TPOT by TP degree
From one GPU to a production service
A workstation test sees one fixed set of GPUs. Kubernetes or a fleet scheduler must preserve the topology every time a replica starts. Device selection, rank order, NCCL configuration, shared memory, and health checks become part of the model deployment specification.
Replica sizing should compare scale-up and scale-out. TP8 may fit the model but offer fewer total tokens per GPU than two TP4 replicas. The right answer depends on memory, queueing, batch efficiency, and whether one replica meets the largest request.
Failure domains are wider than one pod. If any rank fails, the TP replica usually fails. Readiness should be collective, draining should stop new admission across all ranks, and restart policy should avoid partial zombie groups.
Design-review questions
- What is the smallest TP degree that fits the worst request?
- Are all ranks inside the intended fabric domain?
- How does throughput per GPU change with TP degree?
- What happens to in-flight requests when one rank fails?
- Would more replicas beat a wider TP group?
How it connects to the rest of the series
Pipeline parallelism splits depth instead of tensors. Sequence parallelism reduces replicated activations around TP. Expert parallelism distributes MoE experts and adds all-to-all traffic.
From equation to implementation
Column parallelism computes Y_i = X A_i for column shards A_i, leaving output features sharded. Row parallelism consumes input shards X_i and computes partial X_i B_i, then sums across ranks. Alternating these patterns reduces unnecessary gathers inside a transformer block.
Collective cost follows topology and message size. Ring all-reduce roughly transfers 2(N-1)/N times the tensor bytes per rank, plus latency across steps. Tree or NVSwitch algorithms differ. The critical point is that decode repeats these collectives for every layer and token.
Implementation sketch
initialize_tp_group(ranks_on_same_nvlink_domain)
for transformer_layer:
qkv_shards = column_parallel_linear(hidden)
attention_shard = local_attention(qkv_shards)
hidden = row_parallel_linear_and_all_reduce(attention_shard)
mlp_shards = column_parallel_linear(hidden)
hidden = row_parallel_linear_and_all_reduce(mlp_shards)Capacity planning
Choose TP primarily to fit weights, KV shards, and workspace, then test the next smaller and larger degrees. More TP reduces bytes per rank but may shrink GEMMs below efficient tile sizes and add ranks to every collective.
Benchmarking without fooling yourself
- Pin rank placement and record the exact topology.
- Test batch-one decode and realistic continuous batches.
- Capture NCCL traces and per-rank kernel gaps.
- Report tokens per second per GPU, not aggregate alone.
A production failure to design for
A Kubernetes reschedule places half of a TP group across a slower inter-node link. The model remains healthy, but TPOT triples and collective time dominates. Enforce topology constraints and expose TP-group locality in readiness checks.
Primary references
The takeaway
Tensor parallelism buys memory and compute with communication. The winning configuration is the smallest TP degree that fits and performs well on the actual topology.
