Skip to content
11/20 - Pipeline Parallelism: Turning Model Depth into an Assembly Line

11/20 - Pipeline Parallelism: Turning Model Depth into an Assembly Line

Tensor parallelism divides every layer. Pipeline parallelism assigns whole groups of layers to different devices. The model becomes an assembly line: activations move forward through stages, and utilization depends on keeping every stage busy.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A factory with four stations can build several products at once, each at a different station. But the first product waits for all four stations, and a slow station makes every other station idle behind it.

MECHANISM FLOWPipeline Parallelism: request path01MicrobatchEnter stage 1Run early layers02Pipeline handoffSend activationsBalance stage work03Final stageRun last layersReturn logitsINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Microbatch, where enter stage 1. The middle stage, Pipeline handoff, send activations. The final stage, Final stage, shows the observable result: run last layers. The arrows describe dependency order, not necessarily separate services.

What actually happens

The model’s ordered layers are partitioned into stages. Stage 0 receives token activations, executes its layers, and sends activations to stage 1. Each stage stores only its assigned weights, reducing per-device model memory.

Microbatches fill the pipeline so different requests or batch slices occupy different stages concurrently. Fill and drain periods create a pipeline bubble. More microbatches amortize that bubble but can increase queueing and memory.

In autoregressive inference, every generated token traverses all stages. Point-to-point transfers and stage imbalance repeat on the token path. Pipeline parallelism can make a model fit, but it does not automatically reduce single-request latency.

A worked example

A 48-layer model is split into four 12-layer stages. With one request, the stages mostly work sequentially. With eight microbatches, stage 1 can process microbatch 4 while stage 2 handles 3, stage 3 handles 2, and stage 4 handles 1. Throughput rises after the pipe fills.

The performance model

For balanced stages, ideal utilization approaches microbatches divided by microbatches plus stages minus one for a simple forward pipeline. Real systems add transfer time, unequal layer cost, and decode synchronization.

PHASE FITWhere Pipeline parallelism changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityMicrobatches fill model stagesDECODEOne new token per iterationWeight and KV bandwidth pressureEvery token revisits all stagesPROVE IT WITHStage time, bubbles, TTFT, TPOTDEPLOYMENT DECISIONPartition by measured stage cost
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Pipeline parallelism changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Layer count is a poor balancing proxy when attention, MoE, embeddings, and final projection have different cost. Profile layer time and activation size, then place boundaries to balance compute and avoid expensive cross-node transfers.

TRADE-OFF MAPPipeline Parallelism: the tradeoffBASELINESingle-device depthAll layers on one GPUNo stage transfersNo pipeline bubbleLimited weight capacityVSOPTIMIZEDPipeline parallel depthLayer groups on devicesActivation handoffsNeeds microbatches to fillStage balance is criticalMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Single-device depth, characterized by all layers on one gpu and no stage transfers. The right panel applies Pipeline parallel depth, changing the cost profile to layer groups on devices and activation handoffs. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Very deep models that exceed one GPU
  • Throughput workloads with enough microbatches
  • Topologies with fast adjacent-stage links

Where it disappoints

  • Splitting stages by layer count without profiling
  • Expecting lower latency for a single request
  • Crossing slow nodes at every stage boundary
  • Ignoring the first and last stage imbalance

Production checklist

  • Profile per-layer time and activation bytes
  • Choose boundaries from measured cost
  • Use enough microbatches to amortize bubbles
  • Keep adjacent stages on suitable links
  • Measure prefill and decode separately

What to measure

  • Stage utilization and idle bubble time
  • Per-stage latency and memory
  • Activation transfer duration
  • End-to-end token pipeline latency
  • Throughput by microbatch count

From one GPU to a production service

A local pipeline can assume ranks start together. A production scheduler must create the entire gang, place adjacent stages on suitable links, load only each stage’s weights, and expose one logical endpoint. Partial readiness is not readiness.

Traffic classes affect fill. Online requests may not supply enough microbatches to keep many stages busy, while offline jobs can. A shared deployment can reserve a low-latency lane or use separate replicas rather than forcing one schedule to satisfy both.

Model changes can move the bottleneck. Adding MoE layers, a larger vocabulary head, or a vision encoder changes stage balance. Store profiling data with the partition plan and rebuild boundaries when architecture changes.

Design-review questions

  • Can the scheduler place all stages atomically?
  • Which stage limits steady throughput and peak memory?
  • How many microbatches are needed to reach useful utilization?
  • Does online latency justify a separate topology?
  • How is a failed stage drained and reconstructed?

How it connects to the rest of the series

Tensor parallelism splits within a stage and is often combined with pipeline parallelism. Sequence parallelism reduces activation replication. Expert parallelism adds a second routing topology inside MoE stages.

From equation to implementation

For P pipeline stages and M microbatches, a simple forward-only schedule needs roughly M + P - 1 stage slots, so ideal bubble fraction is about (P - 1)/(M + P - 1). This is only a mental model: unequal stages and activation transfer widen the real bubble.

Inference has two schedules. Prefill can pipeline prompt microbatches efficiently. Decode sends small activations through every stage once per token, making per-stage launch and transfer latency much more visible. Some engines batch many sequences to restore stage efficiency.

Implementation sketch

profile_each_layer()
partition_layers_to_balance_time_and_memory(P)
for microbatch in input_batch:
    stage0.enqueue(microbatch)
for clock in pipeline_schedule:
    for stage in stages:
        receive_activation_if_ready(stage)
        run_assigned_layers(stage)
        send_activation_to_next_stage(stage)
collect_logits_in_original_request_order()

Capacity planning

Each stage needs its weights, activation buffers for in-flight microbatches, KV state for its layers, and communication workspace. The first and last stages may also own embeddings or vocabulary projection, so equal layer counts rarely mean equal memory.

Benchmarking without fooling yourself

  • Plot stage utilization on a shared timeline.
  • Sweep microbatch count until gains flatten or memory rises sharply.
  • Measure first-result latency separately from steady throughput.
  • Inject a deliberately slow stage to verify backpressure and queue bounds.

A production failure to design for

One stage contains the final vocabulary projection and takes 35 percent longer than the others. Upstream activation queues grow until memory pressure causes retries. Rebalance using measured stage time and bound inter-stage buffers so imbalance fails visibly.

OPERATING LOOPOperational loop1ProfileLayer time and memoryActivation size2PartitionBalance stagesPlace on topology3FillChoose microbatchesBound buffers4OperateWatch bubble and skewRebalance on rolloutMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Profile to Partition, then Fill and Operate. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Pipeline parallelism assigns consecutive layer ranges to stages and sends activations between them. Microbatches keep stages busy concurrently, but a request still traverses every stage. The slowest stage sets steady-state throughput, while fill and drain bubbles determine efficiency for short runs.

Microbatches through a model pipelineStage 1Embedding + early layersMicrobatch A then BSend activationsStage 2Middle layer groupReceive A while B waitsBalance computeStage 3Late layer groupOverlap different batchesBound buffersStage 4Final norm and logitsReturn token scoresDrain pipelineSteady throughput is constrained by the slowest stage plus activation transfer.
Pipelining overlaps requests or microbatches across model depth.

How to read this diagram: Follow the state from Stage 1 through Stage 2 and Stage 3 to Stage 4. Each box is an ownership or computation boundary. In particular, steady throughput is constrained by the slowest stage plus activation transfer. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Balance by measured stage time, not layer count. Attention, MoE blocks, vocabulary projection, and communication can differ sharply. Boundaries also change activation size, so an apparently balanced split may overload a link. Reprofile after quantization, kernel, sequence-length, or hardware changes.

Pipeline efficiency depends on bubble fractionUnbalanced stagesidle bubblesBalanced pipelinehigher utilizationLatency tradeoffMore microbatches improve fill but increase queueing and memory.Capacity gainLayer groups fit across multiple device memories.
The best microbatch count depends on latency SLO and stage balance.

How to read this diagram: The bars compare Unbalanced stages with Balanced pipeline on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Layer groups fit across multiple device memories.”, remains larger than the risk, “More microbatches improve fill but increase queueing and memory.”, under production traffic.

Inference scheduling differs from training. Autoregressive decode revisits the pipeline for every token and can interleave many sequences. Each stage needs bounded input buffers and cancellation propagation. A failed downstream stage must release upstream reservations to avoid deadlock.

A microbatch inside the pipelineQueuedstage buffer owns itExecutinglocal layers runTransferringactivation in flightAcceptednext stage owns itOwnership transfers exactly once; cancellation follows the same stage sequence.
Explicit handoff state prevents duplicate work and stranded buffers.

How to read this diagram: State advances from Queued to Executing, Transferring, and finally Accepted. The labels below each state identify what becomes true at that boundary. The governing invariant is: Ownership transfers exactly once; cancellation follows the same stage sequence. Retries and cancellation must preserve the same transition rules.

Pipeline design has four coupled variablesPartitionMeasured layer timeActivation boundariesMicrobatchBubble reductionMemory and latencyTopologyAdjacent-stage linksFailure domainsSchedulerPrefill/decode orderingBackpressure propagationOptimize the entire stage graph, not each GPU independently.
A pipeline is stable only when every stage can shed pressure upstream.

How to read this diagram: The four panels are independent review axes: Partition, Microbatch, Topology, and Scheduler. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Optimize the entire stage graph, not each GPU independently.

Stage imbalance becomes a queue cascadeOne stage slowsProjection cost risesInput buffer fillsUpstream blocksActivations cannot moveGPU memory stays heldPipeline stallsBubbles propagateRequests time outControlBound every bufferRepartition and drainPer-stage queue depth identifies imbalance earlier than end-to-end latency.
Backpressure must cross stage boundaries before memory is exhausted.

How to read this diagram: This is a causal chain, not four unrelated symptoms. One stage slows triggers Upstream blocks, which creates Pipeline stalls. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Pipeline parallelism is a capacity and throughput tool. It shines when the assembly line stays full and every station takes roughly the same time.