Sequence Parallelism: Divide the Tokens, Not the Meaning

#sequence-parallelism #context-parallelism #long-context #multi-gpu #inference

Long sequences create large activations even when model weights fit comfortably. Sequence parallelism reduces that replication by splitting token positions across devices. The phrase is overloaded, however, and confusing two different techniques leads to bad sizing decisions.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A long document can be divided among reviewers for local editing, but any question comparing distant paragraphs requires information exchange. Token-local operations are easy to split; attention is the conversation across partitions.

Follow the state and work from left to right.

What actually happens

In Megatron-style sequence parallelism paired with tensor parallelism, operations such as LayerNorm and Dropout are computed on sequence shards. Reduce-scatter and all-gather replace replicated activation storage around tensor-parallel regions.

Context parallelism goes further: it partitions the sequence for all modules, including attention. Queries may stay local, but attention needs keys and values from the full causal context, requiring ring exchange or all-gather-like communication.

For inference, prefill benefits most because many prompt tokens are processed together. Decode introduces one new position at a time, so communication patterns and KV ownership can dominate. State layout must remain compatible with the serving engine.

A worked example

A prompt has 32,768 tokens and activation width 8,192. Splitting the sequence four ways gives each rank 8,192 positions for token-local operations. Attention still needs access to the relevant K and V context, so ranks exchange blocks rather than pretending the sequence is independent.

The performance model

Activation memory falls roughly with the sequence-parallel degree for the sharded regions. Communication grows with boundaries and attention context exchange. The best degree depends on sequence length, hidden size, topology, and overlap.

Expert lens

Be explicit in design documents: Megatron sequence parallelism and full context parallelism are not synonyms. The former shards selected activations around TP; the latter shards the complete sequence and changes attention communication.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Where it wins

Long-context prefill
Tensor-parallel models with large activation memory
Fast interconnects that can overlap exchange

Where it disappoints

Calling context parallelism and sequence parallelism identical
Ignoring causal attention communication
Assuming decode benefits match prefill
Mismatching KV ownership after sharding

Production checklist

Name the exact sequence-sharding variant
Model activation memory by layer
Profile context exchange on target fabric
Validate causal masks across partitions
Test variable and packed sequence lengths

What to measure

Activation bytes per rank
All-gather and reduce-scatter time
Attention context-exchange bandwidth
Prefill latency by context length
Load imbalance from variable sequences

From one GPU to a production service

A proof of concept usually uses equal fixed-length sequences. A service sees padding, packed samples, multimodal spans, and cache hits. Partition metadata must carry true lengths and boundaries so distributed attention does not compute or communicate meaningless tokens.

Topology determines the maximum useful degree. Splitting sequence across nodes reduces activation memory but may exchange the full KV context over a slower fabric. Sometimes a lower context-parallel degree plus FlashAttention is faster.

Routing and admission should recognize long-context cost. A request that needs CP8 cannot be sent to a replica configured for CP2, and building every replica for the largest context wastes resources. Expose context tiers as explicit model routes.

Design-review questions

Is this Megatron SP or full context parallelism?
Which activations and KV blocks are sharded?
How are packed boundaries preserved across ranks?
What context lengths justify the communication?
Are separate long-context replicas more efficient?

How it connects to the rest of the series

Tensor parallelism creates the activation replication that Megatron sequence parallelism reduces. FlashAttention optimizes local attention IO, while context parallelism distributes long attention across devices.

From equation to implementation

Megatron sequence parallelism commonly replaces replicated activations around tensor-parallel regions with reduce-scatter in one direction and all-gather in the other. It does not necessarily shard the attention context itself. Full context parallelism partitions all sequence activations and must exchange K and V information for attention.

For causal attention, rank r needs keys and values from its own and earlier token partitions. Ring implementations circulate KV blocks while each rank accumulates partial attention using numerically stable online softmax. Communication can overlap local attention computation.

Implementation sketch

tokens_local = shard_sequence(tokens, cp_rank)
for layer in transformer:
    local = layernorm(tokens_local)
    q_local, k_local, v_local = project(local)
    state = init_online_softmax()
    for kv_block in ring_exchange(k_local, v_local):
        if block_is_causally_visible(kv_block):
            state.update(attend(q_local, kv_block))
    tokens_local = mlp_and_residual(state.output)
return_sharded_or_gather(tokens_local)

Capacity planning

Compute activation bytes per rank for every region, then add communication buffers that may double-buffer ring transfers. Long contexts reduce local memory but increase the total KV exchanged. The break-even point shifts with NVLink, InfiniBand, and sequence length.

Benchmarking without fooling yourself

Test Megatron SP and full CP as distinct configurations.
Sweep context length while holding total tokens constant.
Profile communication-compute overlap, not only collective totals.
Include packed variable-length sequences and causal masks.

A production failure to design for

A packed batch places document boundaries inside sequence shards, but the distributed attention mask loses one boundary offset. Tokens attend across documents only when a boundary crosses ranks. Build mask tests with tiny hand-checkable packed sequences.

Treat optimization as a measured loop, not a one-time flag.

Primary references

The takeaway

Sequence parallelism is a memory win with a communication bill. Say exactly which sequence regions are sharded and who owns the attention context.

Pipeline Parallelism: Turning Model Depth into an Assembly Line Graph Optimization: Teaching ONNX and TensorRT to See the Whole Model