11/20 - Pipeline Parallelism: Turning Model Depth into an Assembly Line
Tensor parallelism divides every layer. Pipeline parallelism assigns whole groups of layers to different devices. The model becomes an assembly line: activations move forward through stages, and utilization depends on keeping every stage busy.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A factory with four stations can build several products at once, each at a different station. But the first product waits for all four stations, and a slow station makes every other station idle behind it.
How to read this diagram: Start with Microbatch, where enter stage 1. The middle stage, Pipeline handoff, send activations. The final stage, Final stage, shows the observable result: run last layers. The arrows describe dependency order, not necessarily separate services.
What actually happens
The model’s ordered layers are partitioned into stages. Stage 0 receives token activations, executes its layers, and sends activations to stage 1. Each stage stores only its assigned weights, reducing per-device model memory.
Microbatches fill the pipeline so different requests or batch slices occupy different stages concurrently. Fill and drain periods create a pipeline bubble. More microbatches amortize that bubble but can increase queueing and memory.
In autoregressive inference, every generated token traverses all stages. Point-to-point transfers and stage imbalance repeat on the token path. Pipeline parallelism can make a model fit, but it does not automatically reduce single-request latency.
A worked example
A 48-layer model is split into four 12-layer stages. With one request, the stages mostly work sequentially. With eight microbatches, stage 1 can process microbatch 4 while stage 2 handles 3, stage 3 handles 2, and stage 4 handles 1. Throughput rises after the pipe fills.
The performance model
For balanced stages, ideal utilization approaches microbatches divided by microbatches plus stages minus one for a simple forward pipeline. Real systems add transfer time, unequal layer cost, and decode synchronization.
How to read this diagram: The left panel asks how Pipeline parallelism changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.
Expert lens
Layer count is a poor balancing proxy when attention, MoE, embeddings, and final projection have different cost. Profile layer time and activation size, then place boundaries to balance compute and avoid expensive cross-node transfers.
How to read this diagram: The left panel is the baseline, Single-device depth, characterized by all layers on one gpu and no stage transfers. The right panel applies Pipeline parallel depth, changing the cost profile to layer groups on devices and activation handoffs. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.
Where it wins
- Very deep models that exceed one GPU
- Throughput workloads with enough microbatches
- Topologies with fast adjacent-stage links
Where it disappoints
- Splitting stages by layer count without profiling
- Expecting lower latency for a single request
- Crossing slow nodes at every stage boundary
- Ignoring the first and last stage imbalance
Production checklist
- Profile per-layer time and activation bytes
- Choose boundaries from measured cost
- Use enough microbatches to amortize bubbles
- Keep adjacent stages on suitable links
- Measure prefill and decode separately
What to measure
- Stage utilization and idle bubble time
- Per-stage latency and memory
- Activation transfer duration
- End-to-end token pipeline latency
- Throughput by microbatch count
From one GPU to a production service
A local pipeline can assume ranks start together. A production scheduler must create the entire gang, place adjacent stages on suitable links, load only each stage’s weights, and expose one logical endpoint. Partial readiness is not readiness.
Traffic classes affect fill. Online requests may not supply enough microbatches to keep many stages busy, while offline jobs can. A shared deployment can reserve a low-latency lane or use separate replicas rather than forcing one schedule to satisfy both.
Model changes can move the bottleneck. Adding MoE layers, a larger vocabulary head, or a vision encoder changes stage balance. Store profiling data with the partition plan and rebuild boundaries when architecture changes.
Design-review questions
- Can the scheduler place all stages atomically?
- Which stage limits steady throughput and peak memory?
- How many microbatches are needed to reach useful utilization?
- Does online latency justify a separate topology?
- How is a failed stage drained and reconstructed?
How it connects to the rest of the series
Tensor parallelism splits within a stage and is often combined with pipeline parallelism. Sequence parallelism reduces activation replication. Expert parallelism adds a second routing topology inside MoE stages.
From equation to implementation
For P pipeline stages and M microbatches, a simple forward-only schedule needs roughly M + P - 1 stage slots, so ideal bubble fraction is about (P - 1)/(M + P - 1). This is only a mental model: unequal stages and activation transfer widen the real bubble.
Inference has two schedules. Prefill can pipeline prompt microbatches efficiently. Decode sends small activations through every stage once per token, making per-stage launch and transfer latency much more visible. Some engines batch many sequences to restore stage efficiency.
Implementation sketch
profile_each_layer()
partition_layers_to_balance_time_and_memory(P)
for microbatch in input_batch:
stage0.enqueue(microbatch)
for clock in pipeline_schedule:
for stage in stages:
receive_activation_if_ready(stage)
run_assigned_layers(stage)
send_activation_to_next_stage(stage)
collect_logits_in_original_request_order()Capacity planning
Each stage needs its weights, activation buffers for in-flight microbatches, KV state for its layers, and communication workspace. The first and last stages may also own embeddings or vocabulary projection, so equal layer counts rarely mean equal memory.
Benchmarking without fooling yourself
- Plot stage utilization on a shared timeline.
- Sweep microbatch count until gains flatten or memory rises sharply.
- Measure first-result latency separately from steady throughput.
- Inject a deliberately slow stage to verify backpressure and queue bounds.
A production failure to design for
One stage contains the final vocabulary projection and takes 35 percent longer than the others. Upstream activation queues grow until memory pressure causes retries. Rebalance using measured stage time and bound inter-stage buffers so imbalance fails visibly.
How to read this diagram: The operating cycle moves from Profile to Partition, then Fill and Operate. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.
Deeper engineering guide
Pipeline parallelism assigns consecutive layer ranges to stages and sends activations between them. Microbatches keep stages busy concurrently, but a request still traverses every stage. The slowest stage sets steady-state throughput, while fill and drain bubbles determine efficiency for short runs.
How to read this diagram: Follow the state from Stage 1 through Stage 2 and Stage 3 to Stage 4. Each box is an ownership or computation boundary. In particular, steady throughput is constrained by the slowest stage plus activation transfer. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.
Balance by measured stage time, not layer count. Attention, MoE blocks, vocabulary projection, and communication can differ sharply. Boundaries also change activation size, so an apparently balanced split may overload a link. Reprofile after quantization, kernel, sequence-length, or hardware changes.
How to read this diagram: The bars compare Unbalanced stages with Balanced pipeline on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Layer groups fit across multiple device memories.”, remains larger than the risk, “More microbatches improve fill but increase queueing and memory.”, under production traffic.
Inference scheduling differs from training. Autoregressive decode revisits the pipeline for every token and can interleave many sequences. Each stage needs bounded input buffers and cancellation propagation. A failed downstream stage must release upstream reservations to avoid deadlock.
How to read this diagram: State advances from Queued to Executing, Transferring, and finally Accepted. The labels below each state identify what becomes true at that boundary. The governing invariant is: Ownership transfers exactly once; cancellation follows the same stage sequence. Retries and cancellation must preserve the same transition rules.
How to read this diagram: The four panels are independent review axes: Partition, Microbatch, Topology, and Scheduler. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Optimize the entire stage graph, not each GPU independently.
How to read this diagram: This is a causal chain, not four unrelated symptoms. One stage slows triggers Upstream blocks, which creates Pipeline stalls. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.
Primary references
The takeaway
Pipeline parallelism is a capacity and throughput tool. It shines when the assembly line stays full and every station takes roughly the same time.
