11/20 - Pipeline Parallelism: Turning Model Depth into an Assembly Line

#pipeline-parallelism #multi-gpu #inference #model-parallelism #gpu

Tensor parallelism divides every layer. Pipeline parallelism assigns whole groups of layers to different devices. The model becomes an assembly line: activations move forward through stages, and utilization depends on keeping every stage busy.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A factory with four stations can build several products at once, each at a different station. But the first product waits for all four stations, and a slow station makes every other station idle behind it.

Follow the state and work from left to right.

Description: Start with Microbatch, where enter stage 1. The middle stage, Pipeline handoff, send activations. The final stage, Final stage, shows the observable result: run last layers. The arrows describe dependency order, not necessarily separate services.

What actually happens

The model’s ordered layers are partitioned into stages. Stage 0 receives token activations, executes its layers, and sends activations to stage 1. Each stage stores only its assigned weights, reducing per-device model memory.

Microbatches fill the pipeline so different requests or batch slices occupy different stages concurrently. Fill and drain periods create a pipeline bubble. More microbatches amortize that bubble but can increase queueing and memory.

In autoregressive inference, every generated token traverses all stages. Point-to-point transfers and stage imbalance repeat on the token path. Pipeline parallelism can make a model fit, but it does not automatically reduce single-request latency.

A worked example

A 48-layer model is split into four 12-layer stages. With one request, the stages mostly work sequentially. With eight microbatches, stage 1 can process microbatch 4 while stage 2 handles 3, stage 3 handles 2, and stage 4 handles 1. Throughput rises after the pipe fills.

The performance model

For balanced stages, ideal utilization approaches microbatches divided by microbatches plus stages minus one for a simple forward pipeline. Real systems add transfer time, unequal layer cost, and decode synchronization.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Pipeline parallelism changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Layer count is a poor balancing proxy when attention, MoE, embeddings, and final projection have different cost. Profile layer time and activation size, then place boundaries to balance compute and avoid expensive cross-node transfers.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Single-device depth, characterized by all layers on one gpu and no stage transfers. The right panel applies Pipeline parallel depth, changing the cost profile to layer groups on devices and activation handoffs. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Very deep models that exceed one GPU
Throughput workloads with enough microbatches
Topologies with fast adjacent-stage links

Where it disappoints

Splitting stages by layer count without profiling
Expecting lower latency for a single request
Crossing slow nodes at every stage boundary
Ignoring the first and last stage imbalance

Production checklist

Profile per-layer time and activation bytes
Choose boundaries from measured cost
Use enough microbatches to amortize bubbles
Keep adjacent stages on suitable links
Measure prefill and decode separately

What to measure

Stage utilization and idle bubble time
Per-stage latency and memory
Activation transfer duration
End-to-end token pipeline latency
Throughput by microbatch count

From one GPU to a production service

A local pipeline can assume ranks start together. A production scheduler must create the entire gang, place adjacent stages on suitable links, load only each stage’s weights, and expose one logical endpoint. Partial readiness is not readiness.

Traffic classes affect fill. Online requests may not supply enough microbatches to keep many stages busy, while offline jobs can. A shared deployment can reserve a low-latency lane or use separate replicas rather than forcing one schedule to satisfy both.

Model changes can move the bottleneck. Adding MoE layers, a larger vocabulary head, or a vision encoder changes stage balance. Store profiling data with the partition plan and rebuild boundaries when architecture changes.

Design-review questions

Can the scheduler place all stages atomically?
Which stage limits steady throughput and peak memory?
How many microbatches are needed to reach useful utilization?
Does online latency justify a separate topology?
How is a failed stage drained and reconstructed?

How it connects to the rest of the series

Tensor parallelism splits within a stage and is often combined with pipeline parallelism. Sequence parallelism reduces activation replication. Expert parallelism adds a second routing topology inside MoE stages.

From equation to implementation

For P pipeline stages and M microbatches, a simple forward-only schedule needs roughly M + P - 1 stage slots, so ideal bubble fraction is about (P - 1)/(M + P - 1). This is only a mental model: unequal stages and activation transfer widen the real bubble.

Inference has two schedules. Prefill can pipeline prompt microbatches efficiently. Decode sends small activations through every stage once per token, making per-stage launch and transfer latency much more visible. Some engines batch many sequences to restore stage efficiency.

Implementation sketch

profile_each_layer()
partition_layers_to_balance_time_and_memory(P)
for microbatch in input_batch:
    stage0.enqueue(microbatch)
for clock in pipeline_schedule:
    for stage in stages:
        receive_activation_if_ready(stage)
        run_assigned_layers(stage)
        send_activation_to_next_stage(stage)
collect_logits_in_original_request_order()

Capacity planning

Each stage needs its weights, activation buffers for in-flight microbatches, KV state for its layers, and communication workspace. The first and last stages may also own embeddings or vocabulary projection, so equal layer counts rarely mean equal memory.

Benchmarking without fooling yourself

Plot stage utilization on a shared timeline.
Sweep microbatch count until gains flatten or memory rises sharply.
Measure first-result latency separately from steady throughput.
Inject a deliberately slow stage to verify backpressure and queue bounds.

A production failure to design for

One stage contains the final vocabulary projection and takes 35 percent longer than the others. Upstream activation queues grow until memory pressure causes retries. Rebalance using measured stage time and bound inter-stage buffers so imbalance fails visibly.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Profile to Partition, then Fill and Operate. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Pipeline parallelism assigns consecutive layer ranges to stages and sends activations between them. Microbatches keep stages busy concurrently, but a request still traverses every stage. The slowest stage sets steady-state throughput, while fill and drain bubbles determine efficiency for short runs.

Pipelining overlaps requests or microbatches across model depth.

Description: Follow the state from Stage 1 through Stage 2 and Stage 3 to Stage 4. Each box is an ownership or computation boundary. In particular, steady throughput is constrained by the slowest stage plus activation transfer. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Balance by measured stage time, not layer count. Attention, MoE blocks, vocabulary projection, and communication can differ sharply. Boundaries also change activation size, so an apparently balanced split may overload a link. Reprofile after quantization, kernel, sequence-length, or hardware changes.

The best microbatch count depends on latency SLO and stage balance.

Description: The bars compare Unbalanced stages with Balanced pipeline on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Layer groups fit across multiple device memories.”, remains larger than the risk, “More microbatches improve fill but increase queueing and memory.”, under production traffic.

Inference scheduling differs from training. Autoregressive decode revisits the pipeline for every token and can interleave many sequences. Each stage needs bounded input buffers and cancellation propagation. A failed downstream stage must release upstream reservations to avoid deadlock.

Explicit handoff state prevents duplicate work and stranded buffers.

Description: State advances from Queued to Executing, Transferring, and finally Accepted. The labels below each state identify what becomes true at that boundary. The governing invariant is: Ownership transfers exactly once; cancellation follows the same stage sequence. Retries and cancellation must preserve the same transition rules.

A pipeline is stable only when every stage can shed pressure upstream.

Description: The four panels are independent review axes: Partition, Microbatch, Topology, and Scheduler. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Optimize the entire stage graph, not each GPU independently.

Backpressure must cross stage boundaries before memory is exhausted.

Description: This is a causal chain, not four unrelated symptoms. One stage slows triggers Upstream blocks, which creates Pipeline stalls. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Pipeline parallelism is a capacity and throughput tool. It shines when the assembly line stays full and every station takes roughly the same time.

10/20 - Tensor Parallelism: Splitting One Layer Across Many GPUs 12/20 - Sequence Parallelism: Divide the Tokens, Not the Meaning