20/20 - Expert Parallelism: Routing Tokens Through a City of Specialists

#expert-parallelism #moe #mixture-of-experts #all-to-all #llm-inference

A mixture-of-experts model may contain hundreds of expert networks while activating only a few for each token. Expert parallelism distributes those specialists across GPUs. Compute stays sparse, but tokens now travel through a network whose balance can decide the entire serving result.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A city routes each case to two specialized clinics instead of sending every citizen through every hospital. Expertise is efficient, but popular clinics form queues and transportation can overwhelm treatment time.

Follow the state and work from left to right.

Description: Start with Token activations, where router selects top-k. The middle stage, All-to-all exchange, run local experts. The final stage, Return activations, shows the observable result: combine expert outputs. The arrows describe dependency order, not necessarily separate services.

What actually happens

The router scores experts for each token and selects top-k destinations. Tokens are packed by destination, exchanged across the expert-parallel group, processed by local expert MLPs, and sent back for weighted combination.

Expert parallelism reduces expert-weight memory per GPU and spreads sparse compute. Its defining collective is all-to-all, unlike the all-reduce patterns common in tensor parallelism. Traffic volume and imbalance depend on routing decisions.

Capacity limits protect memory and execution bounds. Overflowing tokens may be dropped, rerouted, or delayed depending on the model and engine. In inference, dropping changes outputs, so serving systems usually need enough capacity or deterministic fallback.

A worked example

A layer has 64 experts distributed across 8 GPUs, eight experts per GPU, with top-2 routing. A batch of 4,096 tokens creates 8,192 expert assignments. Uniform routing gives about 1,024 assignments per GPU; a hot expert can skew one rank far above that and stall the full collective.

The performance model

Layer time is routing and packing plus all-to-all plus the slowest rank’s grouped expert compute plus return exchange. Average load is insufficient because the most loaded rank gates completion. Larger token batches improve grouped GEMM efficiency but can increase queueing.

Prefill and decode run the same model but expose different bottlenecks and SLOs.

Description: The left panel asks how Expert parallelism changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Combining EP with TP changes process groups and communication order. Some stacks require sequence parallelism when TP and EP coexist. Expert placement can exploit routing affinity, but dynamic placement must preserve weights, cache state, and reproducibility.

The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

Description: The left panel is the baseline, Dense transformer MLP, characterized by every token uses all weights and predictable compute. The right panel applies Expert-parallel MoE, changing the cost profile to top-k experts per token and all-to-all token exchange. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

Large MoE models whose expert weights exceed one GPU
Traffic with reasonably balanced routing
Fast fabrics and efficient grouped GEMM

Where it disappoints

Monitoring average load but not hottest rank
Under-sizing expert capacity
Sending tiny expert batches to inefficient kernels
Combining TP and EP without correct process groups

Production checklist

Measure routing distribution by workload
Size capacity for percentile skew
Use grouped GEMM and efficient token packing
Map EP groups to high-bandwidth topology
Test hot-expert and adversarial prompts

What to measure

Tokens per expert and coefficient of variation
All-to-all bytes and duration
Maximum rank load versus mean
Dropped or rerouted token count
Expert GEMM occupancy

From one GPU to a production service

A single-node EP test uses a uniform fabric. A cluster deployment must place each expert group within a topology that supports its all-to-all pattern. Crossing oversubscribed links makes every MoE layer vulnerable to unrelated traffic.

Router statistics are model behavior and capacity data. Persist distributions by layer, tenant, and workload without logging sensitive tokens. Sudden skew may signal traffic change, attack, or model drift.

Serving multiple MoE checkpoints complicates expert placement. Replicating hot experts can reduce skew but changes memory and routing. Start with static placement, measure, then introduce replication only with a deterministic consistency plan.

Design-review questions

Which rank is hottest at each MoE layer?
Does EP traffic remain within the intended fabric?
How are overflow tokens handled without quality loss?
Can hot experts be replicated or re-placed safely?
What admission limit protects all-to-all buffers?

How it connects to the rest of the series

Tensor parallelism splits each expert; sequence parallelism can support TP plus EP; mixed precision and quantized kernels reduce expert weight traffic; continuous batching shapes the token population entering the router.

From equation to implementation

Let T tokens each choose k experts across E experts. The average assignments per expert are Tk/E, but latency follows the maximum, not the average. Router skew and finite batch size produce variance even with a balanced model. Capacity is commonly expressed as a multiple of average load.

Efficient execution sorts or groups token activations by local expert, runs grouped GEMMs for variable expert batch sizes, then unsorts returned outputs. Packing and permutation kernels can become visible when expert batches are small.

Implementation sketch

scores = router(hidden)
experts, weights = top_k(scores, k)
packed, metadata = group_tokens_by_destination(hidden, experts)
remote = all_to_all(packed, ep_group)
local_outputs = grouped_gemm(local_expert_weights, remote)
returned = all_to_all(local_outputs, ep_group)
output = weighted_combine(unpack(returned, metadata), weights)

Capacity planning

Allocate for percentile expert load plus communication buffers, not perfectly uniform routing. The largest expert shard, temporary packed activations, and two all-to-all buffers often determine peak memory.

Benchmarking without fooling yourself

Replay real prompts because routing is content-dependent.
Report maximum and percentile expert load per layer.
Profile permutation, all-to-all, and grouped GEMM separately.
Test small batches, hot experts, and adversarial routing.

A production failure to design for

A new domain routes heavily to three experts on one rank. Average GPU utilization looks moderate, but that rank gates every layer and all-to-all buffers overflow. Add per-layer hot-rank telemetry and consider expert placement or routing regularization.

Treat optimization as a measured loop, not a one-time flag.

Description: The operating cycle moves from Observe to Place, then Execute and Balance. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

In a mixture-of-experts layer, a router scores each token, selects top-k experts, dispatches token activations to the ranks hosting those experts, runs expert MLPs, and combines weighted outputs. Only selected experts compute each token, but communication and load imbalance become first-order costs.

Expert parallelism replaces dense compute with sparse routing and distributed communication.

Description: Follow the state from Score through Dispatch and Execute to Combine. Each box is an ownership or computation boundary. In particular, token identity and ordering must survive the all-to-all exchange. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Average load can look balanced while one expert forms a hot spot. Capacity factor reserves slots above the average, but excess capacity wastes memory and too little causes drops, reroutes, or queueing. Auxiliary training losses help, yet serving traffic can differ from training and needs live expert-level telemetry.

The realized gain depends on expert balance and communication topology.

Description: The bars compare Dense MLP with Top-k experts on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “More total parameters with bounded active compute.”, remains larger than the risk, “Hot experts create all-to-all and queue bottlenecks.”, under production traffic.

Placement should keep frequently co-selected experts near traffic and distribute hot experts across failure domains. Replication can relieve hot experts but complicates routing consistency and weight updates. The scheduler may combine tokens by expert to create efficient GEMMs while preserving each request’s deadline.

Routing metadata is correctness state, not scheduler bookkeeping.

Description: State advances from Routed to Dispatched, Executed, and finally Combined. The labels below each state identify what becomes true at that boundary. The governing invariant is: Every selected expert contribution is accounted for exactly once or explicitly dropped by policy. Retries and cancellation must preserve the same transition rules.

MoE performance is governed by the busiest expert and slowest exchange.

Description: The four panels are independent review axes: Balance, Topology, Batching, and Resilience. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Dashboard p99 expert load and communication, not only layer averages.

Sparse models require sparse observability at the same routing granularity.

Description: This is a causal chain, not four unrelated symptoms. Traffic shifts triggers Expert fills, which creates Layer stalls. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Expert parallelism makes sparse models deployable by turning model capacity into a routing problem. The fastest expert layer is the one whose network and hottest rank remain under control.

19/20 - Prefill-Decode Disaggregation: Two Worker Pools, One Token Stream 1/21 - Inside the GPU: From SMs to HBM Without the Hand-Waving