Skip to content
12/20 - Sequence Parallelism: Divide the Tokens, Not the Meaning

12/20 - Sequence Parallelism: Divide the Tokens, Not the Meaning

Long sequences create large activations even when model weights fit comfortably. Sequence parallelism reduces that replication by splitting token positions across devices. The phrase is overloaded, however, and confusing two different techniques leads to bad sizing decisions.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A long document can be divided among reviewers for local editing, but any question comparing distant paragraphs requires information exchange. Token-local operations are easy to split; attention is the conversation across partitions.

MECHANISM FLOWSequence Parallelism: request path01Long token sequenceShard token positionsLocal token operations02Communication pointGather needed contextRun attention safely03Sharded outputReduce or retain shardsContinue next layerINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Long token sequence, where shard token positions. The middle stage, Communication point, gather needed context. The final stage, Sharded output, shows the observable result: reduce or retain shards. The arrows describe dependency order, not necessarily separate services.

What actually happens

In Megatron-style sequence parallelism paired with tensor parallelism, operations such as LayerNorm and Dropout are computed on sequence shards. Reduce-scatter and all-gather replace replicated activation storage around tensor-parallel regions.

Context parallelism goes further: it partitions the sequence for all modules, including attention. Queries may stay local, but attention needs keys and values from the full causal context, requiring ring exchange or all-gather-like communication.

For inference, prefill benefits most because many prompt tokens are processed together. Decode introduces one new position at a time, so communication patterns and KV ownership can dominate. State layout must remain compatible with the serving engine.

A worked example

A prompt has 32,768 tokens and activation width 8,192. Splitting the sequence four ways gives each rank 8,192 positions for token-local operations. Attention still needs access to the relevant K and V context, so ranks exchange blocks rather than pretending the sequence is independent.

The performance model

Activation memory falls roughly with the sequence-parallel degree for the sharded regions. Communication grows with boundaries and attention context exchange. The best degree depends on sequence length, hidden size, topology, and overlap.

PHASE FITWhere Sequence parallelism changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityShards long prompt activationsDECODEOne new token per iterationWeight and KV bandwidth pressureOften smaller benefit per new tokenPROVE IT WITHActivation bytes and collectivesDEPLOYMENT DECISIONEnable above the memory crossover
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Sequence parallelism changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Be explicit in design documents: Megatron sequence parallelism and full context parallelism are not synonyms. The former shards selected activations around TP; the latter shards the complete sequence and changes attention communication.

TRADE-OFF MAPSequence Parallelism: the tradeoffBASELINEReplicated sequenceEvery rank holds token activationsSimple local operationsHigh activation memoryNo sequence collectivesVSOPTIMIZEDSequence-sharded executionRanks own token rangesLower activation memoryContext exchange for attentionTopology-sensitive overlapMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Replicated sequence, characterized by every rank holds token activations and simple local operations. The right panel applies Sequence-sharded execution, changing the cost profile to ranks own token ranges and lower activation memory. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Long-context prefill
  • Tensor-parallel models with large activation memory
  • Fast interconnects that can overlap exchange

Where it disappoints

  • Calling context parallelism and sequence parallelism identical
  • Ignoring causal attention communication
  • Assuming decode benefits match prefill
  • Mismatching KV ownership after sharding

Production checklist

  • Name the exact sequence-sharding variant
  • Model activation memory by layer
  • Profile context exchange on target fabric
  • Validate causal masks across partitions
  • Test variable and packed sequence lengths

What to measure

  • Activation bytes per rank
  • All-gather and reduce-scatter time
  • Attention context-exchange bandwidth
  • Prefill latency by context length
  • Load imbalance from variable sequences

From one GPU to a production service

A proof of concept usually uses equal fixed-length sequences. A service sees padding, packed samples, multimodal spans, and cache hits. Partition metadata must carry true lengths and boundaries so distributed attention does not compute or communicate meaningless tokens.

Topology determines the maximum useful degree. Splitting sequence across nodes reduces activation memory but may exchange the full KV context over a slower fabric. Sometimes a lower context-parallel degree plus FlashAttention is faster.

Routing and admission should recognize long-context cost. A request that needs CP8 cannot be sent to a replica configured for CP2, and building every replica for the largest context wastes resources. Expose context tiers as explicit model routes.

Design-review questions

  • Is this Megatron SP or full context parallelism?
  • Which activations and KV blocks are sharded?
  • How are packed boundaries preserved across ranks?
  • What context lengths justify the communication?
  • Are separate long-context replicas more efficient?

How it connects to the rest of the series

Tensor parallelism creates the activation replication that Megatron sequence parallelism reduces. FlashAttention optimizes local attention IO, while context parallelism distributes long attention across devices.

From equation to implementation

Megatron sequence parallelism commonly replaces replicated activations around tensor-parallel regions with reduce-scatter in one direction and all-gather in the other. It does not necessarily shard the attention context itself. Full context parallelism partitions all sequence activations and must exchange K and V information for attention.

For causal attention, rank r needs keys and values from its own and earlier token partitions. Ring implementations circulate KV blocks while each rank accumulates partial attention using numerically stable online softmax. Communication can overlap local attention computation.

Implementation sketch

tokens_local = shard_sequence(tokens, cp_rank)
for layer in transformer:
    local = layernorm(tokens_local)
    q_local, k_local, v_local = project(local)
    state = init_online_softmax()
    for kv_block in ring_exchange(k_local, v_local):
        if block_is_causally_visible(kv_block):
            state.update(attend(q_local, kv_block))
    tokens_local = mlp_and_residual(state.output)
return_sharded_or_gather(tokens_local)

Capacity planning

Compute activation bytes per rank for every region, then add communication buffers that may double-buffer ring transfers. Long contexts reduce local memory but increase the total KV exchanged. The break-even point shifts with NVLink, InfiniBand, and sequence length.

Benchmarking without fooling yourself

  • Test Megatron SP and full CP as distinct configurations.
  • Sweep context length while holding total tokens constant.
  • Profile communication-compute overlap, not only collective totals.
  • Include packed variable-length sequences and causal masks.

A production failure to design for

A packed batch places document boundaries inside sequence shards, but the distributed attention mask loses one boundary offset. Tokens attend across documents only when a boundary crosses ranks. Build mask tests with tiny hand-checkable packed sequences.

OPERATING LOOPOperational loop1DefineSP versus full CPOwnership by rank2ModelActivation and KV bytesCommunication path3VerifyCausal packed masksNumerical parity4TuneRing overlap and degreeContext-length policyMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Define to Model, then Verify and Tune. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

Sequence parallelism shards token positions for operations whose state would otherwise be replicated across tensor-parallel ranks. Layer normalization, dropout, residual paths, and some attention variants can operate on local token slices, reducing activation memory. Collective transitions restore the layouts required by sharded linear layers.

Sequence-sharded activation flowPartition tokensSplit sequence positionsKeep hidden widthAssign local sliceLocal opsNorm and residualProcess local tokensReduce memoryTransitionGather or reduce-scatterChange shard layoutSynchronize ranksContinueRun sharded linearReturn token slicesRepeat by blockLayout transitions must match the tensor-parallel contract at every boundary.
Sequence parallelism reduces replicated activation state while preserving model semantics.

How to read this diagram: Follow the state from Partition tokens through Local ops and Transition to Continue. Each box is an ownership or computation boundary. In particular, layout transitions must match the tensor-parallel contract at every boundary. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

The memory gain grows with sequence length, batch size, and hidden width. Communication can dominate for short sequences or poorly fused transitions. Context parallelism extends the idea to attention over very long sequences, where ranks must exchange K/V context or use ring-style algorithms.

Activation memory versus communicationReplicated sequencehigh memorySequence shardedlower memoryTransition costGather and reduce-scatter can dominate small workloads.Context benefitLong sequences fit with less activation replication.
The crossover depends on sequence length, rank count, and fabric bandwidth.

How to read this diagram: The bars compare Replicated sequence with Sequence sharded on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Long sequences fit with less activation replication.”, remains larger than the risk, “Gather and reduce-scatter can dominate small workloads.”, under production traffic.

Padding and variable lengths complicate partitioning. Equal token counts can represent unequal useful work, especially with packed sequences or block-sparse masks. Partition metadata must preserve position IDs, masks, and sample boundaries so collective reconstruction cannot mix independent requests.

Layout state across a transformer blockToken-shardedlocal sequence sliceTransitioningcollective in flightFeature-shardedlinear op layoutRestoredtoken ownership clearEvery tensor carries an explicit layout descriptor, never an inferred convention.
Most sequence-parallel bugs are mismatched layouts rather than wrong arithmetic.

How to read this diagram: State advances from Token-sharded to Transitioning, Feature-sharded, and finally Restored. The labels below each state identify what becomes true at that boundary. The governing invariant is: Every tensor carries an explicit layout descriptor, never an inferred convention. Retries and cancellation must preserve the same transition rules.

Four correctness boundariesPositionsGlobal token indicesRotary embedding offsetsMasksCausal and packed rulesNo cross-sample attentionLayoutsToken versus feature shardCollective pairingTopologyRank group membershipUneven length policyValidate distributed output against a single-rank reference across ragged inputs.
Sequence sharding must preserve both tensor values and token identity.

How to read this diagram: The four panels are independent review axes: Positions, Masks, Layouts, and Topology. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Validate distributed output against a single-rank reference across ragged inputs.

A layout mismatch can stay numerically plausibleMetadata driftsRanks disagree on splitShapes still alignCollective runsTokens are reorderedNo runtime errorOutput corruptsAttention uses wrong stateQuality silently fallsControlAssert layout descriptorsGolden distributed testsShape compatibility is not proof of semantic layout compatibility.
Distributed inference needs identity checks around every layout transition.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Metadata drifts triggers Collective runs, which creates Output corrupts. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Sequence parallelism is a memory win with a communication bill. Say exactly which sequence regions are sharded and who owns the attention context.