Skip to content
Sequence Parallelism: Divide the Tokens, Not the Meaning

Sequence Parallelism: Divide the Tokens, Not the Meaning

Long sequences create large activations even when model weights fit comfortably. Sequence parallelism reduces that replication by splitting token positions across devices. The phrase is overloaded, however, and confusing two different techniques leads to bad sizing decisions.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A long document can be divided among reviewers for local editing, but any question comparing distant paragraphs requires information exchange. Token-local operations are easy to split; attention is the conversation across partitions.

Sequence Parallelism: request pathLong token sequenceShard token positionsLocal token operationsCommunication pointGather needed contextRun attention safelySharded outputReduce or retain shardsContinue next layer
Follow the state and work from left to right.

What actually happens

In Megatron-style sequence parallelism paired with tensor parallelism, operations such as LayerNorm and Dropout are computed on sequence shards. Reduce-scatter and all-gather replace replicated activation storage around tensor-parallel regions.

Context parallelism goes further: it partitions the sequence for all modules, including attention. Queries may stay local, but attention needs keys and values from the full causal context, requiring ring exchange or all-gather-like communication.

For inference, prefill benefits most because many prompt tokens are processed together. Decode introduces one new position at a time, so communication patterns and KV ownership can dominate. State layout must remain compatible with the serving engine.

A worked example

A prompt has 32,768 tokens and activation width 8,192. Splitting the sequence four ways gives each rank 8,192 positions for token-local operations. Attention still needs access to the relevant K and V context, so ranks exchange blocks rather than pretending the sequence is independent.

The performance model

Activation memory falls roughly with the sequence-parallel degree for the sharded regions. Communication grows with boundaries and attention context exchange. The best degree depends on sequence length, hidden size, topology, and overlap.

Expert lens

Be explicit in design documents: Megatron sequence parallelism and full context parallelism are not synonyms. The former shards selected activations around TP; the latter shards the complete sequence and changes attention communication.

Sequence Parallelism: the tradeoffReplicated sequenceEvery rank holds token activationsSimple local operationsHigh activation memoryNo sequence collectivesSequence-sharded executionRanks own token rangesLower activation memoryContext exchange for attentionTopology-sensitive overlap
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

  • Long-context prefill
  • Tensor-parallel models with large activation memory
  • Fast interconnects that can overlap exchange

Where it disappoints

  • Calling context parallelism and sequence parallelism identical
  • Ignoring causal attention communication
  • Assuming decode benefits match prefill
  • Mismatching KV ownership after sharding

Production checklist

  • Name the exact sequence-sharding variant
  • Model activation memory by layer
  • Profile context exchange on target fabric
  • Validate causal masks across partitions
  • Test variable and packed sequence lengths

What to measure

  • Activation bytes per rank
  • All-gather and reduce-scatter time
  • Attention context-exchange bandwidth
  • Prefill latency by context length
  • Load imbalance from variable sequences

From one GPU to a production service

A proof of concept usually uses equal fixed-length sequences. A service sees padding, packed samples, multimodal spans, and cache hits. Partition metadata must carry true lengths and boundaries so distributed attention does not compute or communicate meaningless tokens.

Topology determines the maximum useful degree. Splitting sequence across nodes reduces activation memory but may exchange the full KV context over a slower fabric. Sometimes a lower context-parallel degree plus FlashAttention is faster.

Routing and admission should recognize long-context cost. A request that needs CP8 cannot be sent to a replica configured for CP2, and building every replica for the largest context wastes resources. Expose context tiers as explicit model routes.

Design-review questions

  • Is this Megatron SP or full context parallelism?
  • Which activations and KV blocks are sharded?
  • How are packed boundaries preserved across ranks?
  • What context lengths justify the communication?
  • Are separate long-context replicas more efficient?

How it connects to the rest of the series

Tensor parallelism creates the activation replication that Megatron sequence parallelism reduces. FlashAttention optimizes local attention IO, while context parallelism distributes long attention across devices.

From equation to implementation

Megatron sequence parallelism commonly replaces replicated activations around tensor-parallel regions with reduce-scatter in one direction and all-gather in the other. It does not necessarily shard the attention context itself. Full context parallelism partitions all sequence activations and must exchange K and V information for attention.

For causal attention, rank r needs keys and values from its own and earlier token partitions. Ring implementations circulate KV blocks while each rank accumulates partial attention using numerically stable online softmax. Communication can overlap local attention computation.

Implementation sketch

tokens_local = shard_sequence(tokens, cp_rank)
for layer in transformer:
    local = layernorm(tokens_local)
    q_local, k_local, v_local = project(local)
    state = init_online_softmax()
    for kv_block in ring_exchange(k_local, v_local):
        if block_is_causally_visible(kv_block):
            state.update(attend(q_local, kv_block))
    tokens_local = mlp_and_residual(state.output)
return_sharded_or_gather(tokens_local)

Capacity planning

Compute activation bytes per rank for every region, then add communication buffers that may double-buffer ring transfers. Long contexts reduce local memory but increase the total KV exchanged. The break-even point shifts with NVLink, InfiniBand, and sequence length.

Benchmarking without fooling yourself

  • Test Megatron SP and full CP as distinct configurations.
  • Sweep context length while holding total tokens constant.
  • Profile communication-compute overlap, not only collective totals.
  • Include packed variable-length sequences and causal masks.

A production failure to design for

A packed batch places document boundaries inside sequence shards, but the distributed attention mask loses one boundary offset. Tokens attend across documents only when a boundary crosses ranks. Build mask tests with tiny hand-checkable packed sequences.

Operational loopDefineSP versus full CPOwnership by rankModelActivation and KV bytesCommunication pathVerifyCausal packed masksNumerical parityTuneRing overlap and degreeContext-length policy
Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Sequence parallelism is a memory win with a communication bill. Say exactly which sequence regions are sharded and who owns the attention context.