12/20 - Sequence Parallelism: Divide the Tokens, Not the Meaning
Long sequences create large activations even when model weights fit comfortably. Sequence parallelism reduces that replication by splitting token positions across devices. The phrase is overloaded, however, and confusing two different techniques leads to bad sizing decisions.
This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.
Start with the intuition
A long document can be divided among reviewers for local editing, but any question comparing distant paragraphs requires information exchange. Token-local operations are easy to split; attention is the conversation across partitions.
How to read this diagram: Start with Long token sequence, where shard token positions. The middle stage, Communication point, gather needed context. The final stage, Sharded output, shows the observable result: reduce or retain shards. The arrows describe dependency order, not necessarily separate services.
What actually happens
In Megatron-style sequence parallelism paired with tensor parallelism, operations such as LayerNorm and Dropout are computed on sequence shards. Reduce-scatter and all-gather replace replicated activation storage around tensor-parallel regions.
Context parallelism goes further: it partitions the sequence for all modules, including attention. Queries may stay local, but attention needs keys and values from the full causal context, requiring ring exchange or all-gather-like communication.
For inference, prefill benefits most because many prompt tokens are processed together. Decode introduces one new position at a time, so communication patterns and KV ownership can dominate. State layout must remain compatible with the serving engine.
A worked example
A prompt has 32,768 tokens and activation width 8,192. Splitting the sequence four ways gives each rank 8,192 positions for token-local operations. Attention still needs access to the relevant K and V context, so ranks exchange blocks rather than pretending the sequence is independent.
The performance model
Activation memory falls roughly with the sequence-parallel degree for the sharded regions. Communication grows with boundaries and attention context exchange. The best degree depends on sequence length, hidden size, topology, and overlap.
How to read this diagram: The left panel asks how Sequence parallelism changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.
Expert lens
Be explicit in design documents: Megatron sequence parallelism and full context parallelism are not synonyms. The former shards selected activations around TP; the latter shards the complete sequence and changes attention communication.
How to read this diagram: The left panel is the baseline, Replicated sequence, characterized by every rank holds token activations and simple local operations. The right panel applies Sequence-sharded execution, changing the cost profile to ranks own token ranges and lower activation memory. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.
Where it wins
- Long-context prefill
- Tensor-parallel models with large activation memory
- Fast interconnects that can overlap exchange
Where it disappoints
- Calling context parallelism and sequence parallelism identical
- Ignoring causal attention communication
- Assuming decode benefits match prefill
- Mismatching KV ownership after sharding
Production checklist
- Name the exact sequence-sharding variant
- Model activation memory by layer
- Profile context exchange on target fabric
- Validate causal masks across partitions
- Test variable and packed sequence lengths
What to measure
- Activation bytes per rank
- All-gather and reduce-scatter time
- Attention context-exchange bandwidth
- Prefill latency by context length
- Load imbalance from variable sequences
From one GPU to a production service
A proof of concept usually uses equal fixed-length sequences. A service sees padding, packed samples, multimodal spans, and cache hits. Partition metadata must carry true lengths and boundaries so distributed attention does not compute or communicate meaningless tokens.
Topology determines the maximum useful degree. Splitting sequence across nodes reduces activation memory but may exchange the full KV context over a slower fabric. Sometimes a lower context-parallel degree plus FlashAttention is faster.
Routing and admission should recognize long-context cost. A request that needs CP8 cannot be sent to a replica configured for CP2, and building every replica for the largest context wastes resources. Expose context tiers as explicit model routes.
Design-review questions
- Is this Megatron SP or full context parallelism?
- Which activations and KV blocks are sharded?
- How are packed boundaries preserved across ranks?
- What context lengths justify the communication?
- Are separate long-context replicas more efficient?
How it connects to the rest of the series
Tensor parallelism creates the activation replication that Megatron sequence parallelism reduces. FlashAttention optimizes local attention IO, while context parallelism distributes long attention across devices.
From equation to implementation
Megatron sequence parallelism commonly replaces replicated activations around tensor-parallel regions with reduce-scatter in one direction and all-gather in the other. It does not necessarily shard the attention context itself. Full context parallelism partitions all sequence activations and must exchange K and V information for attention.
For causal attention, rank r needs keys and values from its own and earlier token partitions. Ring implementations circulate KV blocks while each rank accumulates partial attention using numerically stable online softmax. Communication can overlap local attention computation.
Implementation sketch
tokens_local = shard_sequence(tokens, cp_rank)
for layer in transformer:
local = layernorm(tokens_local)
q_local, k_local, v_local = project(local)
state = init_online_softmax()
for kv_block in ring_exchange(k_local, v_local):
if block_is_causally_visible(kv_block):
state.update(attend(q_local, kv_block))
tokens_local = mlp_and_residual(state.output)
return_sharded_or_gather(tokens_local)Capacity planning
Compute activation bytes per rank for every region, then add communication buffers that may double-buffer ring transfers. Long contexts reduce local memory but increase the total KV exchanged. The break-even point shifts with NVLink, InfiniBand, and sequence length.
Benchmarking without fooling yourself
- Test Megatron SP and full CP as distinct configurations.
- Sweep context length while holding total tokens constant.
- Profile communication-compute overlap, not only collective totals.
- Include packed variable-length sequences and causal masks.
A production failure to design for
A packed batch places document boundaries inside sequence shards, but the distributed attention mask loses one boundary offset. Tokens attend across documents only when a boundary crosses ranks. Build mask tests with tiny hand-checkable packed sequences.
How to read this diagram: The operating cycle moves from Define to Model, then Verify and Tune. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.
Deeper engineering guide
Sequence parallelism shards token positions for operations whose state would otherwise be replicated across tensor-parallel ranks. Layer normalization, dropout, residual paths, and some attention variants can operate on local token slices, reducing activation memory. Collective transitions restore the layouts required by sharded linear layers.
How to read this diagram: Follow the state from Partition tokens through Local ops and Transition to Continue. Each box is an ownership or computation boundary. In particular, layout transitions must match the tensor-parallel contract at every boundary. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.
The memory gain grows with sequence length, batch size, and hidden width. Communication can dominate for short sequences or poorly fused transitions. Context parallelism extends the idea to attention over very long sequences, where ranks must exchange K/V context or use ring-style algorithms.
How to read this diagram: The bars compare Replicated sequence with Sequence sharded on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “Long sequences fit with less activation replication.”, remains larger than the risk, “Gather and reduce-scatter can dominate small workloads.”, under production traffic.
Padding and variable lengths complicate partitioning. Equal token counts can represent unequal useful work, especially with packed sequences or block-sparse masks. Partition metadata must preserve position IDs, masks, and sample boundaries so collective reconstruction cannot mix independent requests.
How to read this diagram: State advances from Token-sharded to Transitioning, Feature-sharded, and finally Restored. The labels below each state identify what becomes true at that boundary. The governing invariant is: Every tensor carries an explicit layout descriptor, never an inferred convention. Retries and cancellation must preserve the same transition rules.
How to read this diagram: The four panels are independent review axes: Positions, Masks, Layouts, and Topology. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Validate distributed output against a single-rank reference across ragged inputs.
How to read this diagram: This is a causal chain, not four unrelated symptoms. Metadata drifts triggers Collective runs, which creates Output corrupts. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.
Primary references
The takeaway
Sequence parallelism is a memory win with a communication bill. Say exactly which sequence regions are sharded and who owns the attention context.
