Pipeline Parallelism: Turning Model Depth into an Assembly Line

#pipeline-parallelism #multi-gpu #inference #model-parallelism #gpu

Tensor parallelism divides every layer. Pipeline parallelism assigns whole groups of layers to different devices. The model becomes an assembly line: activations move forward through stages, and utilization depends on keeping every stage busy.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A factory with four stations can build several products at once, each at a different station. But the first product waits for all four stations, and a slow station makes every other station idle behind it.

Follow the state and work from left to right.

What actually happens

The model’s ordered layers are partitioned into stages. Stage 0 receives token activations, executes its layers, and sends activations to stage 1. Each stage stores only its assigned weights, reducing per-device model memory.

Microbatches fill the pipeline so different requests or batch slices occupy different stages concurrently. Fill and drain periods create a pipeline bubble. More microbatches amortize that bubble but can increase queueing and memory.

In autoregressive inference, every generated token traverses all stages. Point-to-point transfers and stage imbalance repeat on the token path. Pipeline parallelism can make a model fit, but it does not automatically reduce single-request latency.

A worked example

A 48-layer model is split into four 12-layer stages. With one request, the stages mostly work sequentially. With eight microbatches, stage 1 can process microbatch 4 while stage 2 handles 3, stage 3 handles 2, and stage 4 handles 1. Throughput rises after the pipe fills.

The performance model

For balanced stages, ideal utilization approaches microbatches divided by microbatches plus stages minus one for a simple forward pipeline. Real systems add transfer time, unequal layer cost, and decode synchronization.

Expert lens

Layer count is a poor balancing proxy when attention, MoE, embeddings, and final projection have different cost. Profile layer time and activation size, then place boundaries to balance compute and avoid expensive cross-node transfers.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Very deep models that exceed one GPU
Throughput workloads with enough microbatches
Topologies with fast adjacent-stage links

Where it disappoints

Splitting stages by layer count without profiling
Expecting lower latency for a single request
Crossing slow nodes at every stage boundary
Ignoring the first and last stage imbalance

Production checklist

Profile per-layer time and activation bytes
Choose boundaries from measured cost
Use enough microbatches to amortize bubbles
Keep adjacent stages on suitable links
Measure prefill and decode separately

What to measure

Stage utilization and idle bubble time
Per-stage latency and memory
Activation transfer duration
End-to-end token pipeline latency
Throughput by microbatch count

From one GPU to a production service

A local pipeline can assume ranks start together. A production scheduler must create the entire gang, place adjacent stages on suitable links, load only each stage’s weights, and expose one logical endpoint. Partial readiness is not readiness.

Traffic classes affect fill. Online requests may not supply enough microbatches to keep many stages busy, while offline jobs can. A shared deployment can reserve a low-latency lane or use separate replicas rather than forcing one schedule to satisfy both.

Model changes can move the bottleneck. Adding MoE layers, a larger vocabulary head, or a vision encoder changes stage balance. Store profiling data with the partition plan and rebuild boundaries when architecture changes.

Design-review questions

Can the scheduler place all stages atomically?
Which stage limits steady throughput and peak memory?
How many microbatches are needed to reach useful utilization?
Does online latency justify a separate topology?
How is a failed stage drained and reconstructed?

How it connects to the rest of the series

Tensor parallelism splits within a stage and is often combined with pipeline parallelism. Sequence parallelism reduces activation replication. Expert parallelism adds a second routing topology inside MoE stages.

From equation to implementation

For P pipeline stages and M microbatches, a simple forward-only schedule needs roughly M + P - 1 stage slots, so ideal bubble fraction is about (P - 1)/(M + P - 1). This is only a mental model: unequal stages and activation transfer widen the real bubble.

Inference has two schedules. Prefill can pipeline prompt microbatches efficiently. Decode sends small activations through every stage once per token, making per-stage launch and transfer latency much more visible. Some engines batch many sequences to restore stage efficiency.

Implementation sketch

profile_each_layer()
partition_layers_to_balance_time_and_memory(P)
for microbatch in input_batch:
    stage0.enqueue(microbatch)
for clock in pipeline_schedule:
    for stage in stages:
        receive_activation_if_ready(stage)
        run_assigned_layers(stage)
        send_activation_to_next_stage(stage)
collect_logits_in_original_request_order()

Capacity planning

Each stage needs its weights, activation buffers for in-flight microbatches, KV state for its layers, and communication workspace. The first and last stages may also own embeddings or vocabulary projection, so equal layer counts rarely mean equal memory.

Benchmarking without fooling yourself

Plot stage utilization on a shared timeline.
Sweep microbatch count until gains flatten or memory rises sharply.
Measure first-result latency separately from steady throughput.
Inject a deliberately slow stage to verify backpressure and queue bounds.

A production failure to design for

One stage contains the final vocabulary projection and takes 35 percent longer than the others. Upstream activation queues grow until memory pressure causes retries. Rebalance using measured stage time and bound inter-stage buffers so imbalance fails visibly.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Pipeline parallelism is a capacity and throughput tool. It shines when the assembly line stays full and every station takes roughly the same time.

Tensor Parallelism: Splitting One Layer Across Many GPUs Sequence Parallelism: Divide the Tokens, Not the Meaning