Skip to content
20/20 - Expert Parallelism: Routing Tokens Through a City of Specialists

20/20 - Expert Parallelism: Routing Tokens Through a City of Specialists

A mixture-of-experts model may contain hundreds of expert networks while activating only a few for each token. Expert parallelism distributes those specialists across GPUs. Compute stays sparse, but tokens now travel through a network whose balance can decide the entire serving result.

This article starts with intuition, then moves into the mechanisms and production details. You can stop after the worked example and retain the core idea, or continue into the performance model and operational edge cases.

Start with the intuition

A city routes each case to two specialized clinics instead of sending every citizen through every hospital. Expertise is efficient, but popular clinics form queues and transportation can overwhelm treatment time.

MECHANISM FLOWExpert Parallelism: request path01Token activationsRouter selects top-kGroup by destination02All-to-all exchangeRun local expertsGrouped GEMM03Return activationsCombine expert outputsContinue transformerINPUT → TRANSFORM → OUTCOME
Follow the state and work from left to right.

How to read this diagram: Start with Token activations, where router selects top-k. The middle stage, All-to-all exchange, run local experts. The final stage, Return activations, shows the observable result: combine expert outputs. The arrows describe dependency order, not necessarily separate services.

What actually happens

The router scores experts for each token and selects top-k destinations. Tokens are packed by destination, exchanged across the expert-parallel group, processed by local expert MLPs, and sent back for weighted combination.

Expert parallelism reduces expert-weight memory per GPU and spreads sparse compute. Its defining collective is all-to-all, unlike the all-reduce patterns common in tensor parallelism. Traffic volume and imbalance depend on routing decisions.

Capacity limits protect memory and execution bounds. Overflowing tokens may be dropped, rerouted, or delayed depending on the model and engine. In inference, dropping changes outputs, so serving systems usually need enough capacity or deterministic fallback.

A worked example

A layer has 64 experts distributed across 8 GPUs, eight experts per GPU, with top-2 routing. A batch of 4,096 tokens creates 8,192 expert assignments. Uniform routing gives about 1,024 assignments per GPU; a hot expert can skew one rank far above that and stall the full collective.

The performance model

Layer time is routing and packing plus all-to-all plus the slowest rank’s grouped expert compute plus return exchange. Average load is insufficient because the most loaded rank gates completion. Larger token batches improve grouped GEMM efficiency but can increase queueing.

PHASE FITWhere Expert parallelism changes inferencePREFILLMany prompt tokens in parallelHigh arithmetic intensityBuilds large expert-local token groupsDECODEOne new token per iterationWeight and KV bandwidth pressureSmall groups expose all-to-all costPROVE IT WITHExpert skew, TTFT, TPOT, overflowDEPLOYMENT DECISIONPlace and replicate by live skew
Prefill and decode run the same model but expose different bottlenecks and SLOs.

How to read this diagram: The left panel asks how Expert parallelism changes prompt processing and TTFT; the right asks how it changes iterative generation and inter-token latency. The bottom row names the metric that must improve and the deployment choice justified by that evidence. Optimizing the wrong phase can add complexity without changing the user-visible bottleneck.

Expert lens

Combining EP with TP changes process groups and communication order. Some stacks require sequence parallelism when TP and EP coexist. Expert placement can exploit routing affinity, but dynamic placement must preserve weights, cache state, and reproducibility.

TRADE-OFF MAPExpert Parallelism: the tradeoffBASELINEDense transformer MLPEvery token uses all weightsPredictable computeNo expert routing networkHigh active FLOPsVSOPTIMIZEDExpert-parallel MoETop-k experts per tokenAll-to-all token exchangeLoad can be skewedSparse active FLOPsMEASURE BOTH SIDES UNDER THE SAME WORKLOAD
The optimization changes where the system spends compute, memory, bandwidth, or waiting time.

How to read this diagram: The left panel is the baseline, Dense transformer MLP, characterized by every token uses all weights and predictable compute. The right panel applies Expert-parallel MoE, changing the cost profile to top-k experts per token and all-to-all token exchange. Compare both under the same request shape and load; the optimized side is not automatically better for every workload.

Where it wins

  • Large MoE models whose expert weights exceed one GPU
  • Traffic with reasonably balanced routing
  • Fast fabrics and efficient grouped GEMM

Where it disappoints

  • Monitoring average load but not hottest rank
  • Under-sizing expert capacity
  • Sending tiny expert batches to inefficient kernels
  • Combining TP and EP without correct process groups

Production checklist

  • Measure routing distribution by workload
  • Size capacity for percentile skew
  • Use grouped GEMM and efficient token packing
  • Map EP groups to high-bandwidth topology
  • Test hot-expert and adversarial prompts

What to measure

  • Tokens per expert and coefficient of variation
  • All-to-all bytes and duration
  • Maximum rank load versus mean
  • Dropped or rerouted token count
  • Expert GEMM occupancy

From one GPU to a production service

A single-node EP test uses a uniform fabric. A cluster deployment must place each expert group within a topology that supports its all-to-all pattern. Crossing oversubscribed links makes every MoE layer vulnerable to unrelated traffic.

Router statistics are model behavior and capacity data. Persist distributions by layer, tenant, and workload without logging sensitive tokens. Sudden skew may signal traffic change, attack, or model drift.

Serving multiple MoE checkpoints complicates expert placement. Replicating hot experts can reduce skew but changes memory and routing. Start with static placement, measure, then introduce replication only with a deterministic consistency plan.

Design-review questions

  • Which rank is hottest at each MoE layer?
  • Does EP traffic remain within the intended fabric?
  • How are overflow tokens handled without quality loss?
  • Can hot experts be replicated or re-placed safely?
  • What admission limit protects all-to-all buffers?

How it connects to the rest of the series

Tensor parallelism splits each expert; sequence parallelism can support TP plus EP; mixed precision and quantized kernels reduce expert weight traffic; continuous batching shapes the token population entering the router.

From equation to implementation

Let T tokens each choose k experts across E experts. The average assignments per expert are Tk/E, but latency follows the maximum, not the average. Router skew and finite batch size produce variance even with a balanced model. Capacity is commonly expressed as a multiple of average load.

Efficient execution sorts or groups token activations by local expert, runs grouped GEMMs for variable expert batch sizes, then unsorts returned outputs. Packing and permutation kernels can become visible when expert batches are small.

Implementation sketch

scores = router(hidden)
experts, weights = top_k(scores, k)
packed, metadata = group_tokens_by_destination(hidden, experts)
remote = all_to_all(packed, ep_group)
local_outputs = grouped_gemm(local_expert_weights, remote)
returned = all_to_all(local_outputs, ep_group)
output = weighted_combine(unpack(returned, metadata), weights)

Capacity planning

Allocate for percentile expert load plus communication buffers, not perfectly uniform routing. The largest expert shard, temporary packed activations, and two all-to-all buffers often determine peak memory.

Benchmarking without fooling yourself

  • Replay real prompts because routing is content-dependent.
  • Report maximum and percentile expert load per layer.
  • Profile permutation, all-to-all, and grouped GEMM separately.
  • Test small batches, hot experts, and adversarial routing.

A production failure to design for

A new domain routes heavily to three experts on one rank. Average GPU utilization looks moderate, but that rank gates every layer and all-to-all buffers overflow. Add per-layer hot-rank telemetry and consider expert placement or routing regularization.

OPERATING LOOPOperational loop1ObserveToken routes by layerHottest rank2PlaceExperts on fast fabricCapacity headroom3ExecutePack all-to-all GEMMReturn and combine4BalanceDetect hot expertsAdjust placement or policyMEASURE → LEARN → REPEAT
Treat optimization as a measured loop, not a one-time flag.

How to read this diagram: The operating cycle moves from Observe to Place, then Execute and Balance. The return arrow matters: production evidence from the fourth step must change the assumptions and limits in the first, otherwise the optimization gradually drifts away from the workload it serves.

Deeper engineering guide

In a mixture-of-experts layer, a router scores each token, selects top-k experts, dispatches token activations to the ranks hosting those experts, runs expert MLPs, and combines weighted outputs. Only selected experts compute each token, but communication and load imbalance become first-order costs.

One MoE routing roundScoreRouter reads tokenProduce expert logitsApply routing policyDispatchSelect top-k expertsAll-to-all activationsTrack token originExecuteExpert MLP batchesRespect capacityHandle overflowCombineReturn expert outputsWeight and restore orderContinue transformerToken identity and ordering must survive the all-to-all exchange.
Expert parallelism replaces dense compute with sparse routing and distributed communication.

How to read this diagram: Follow the state from Score through Dispatch and Execute to Combine. Each box is an ownership or computation boundary. In particular, token identity and ordering must survive the all-to-all exchange. A real implementation may fuse boxes, but it must preserve their ordering and correctness contract.

Average load can look balanced while one expert forms a hot spot. Capacity factor reserves slots above the average, but excess capacity wastes memory and too little causes drops, reroutes, or queueing. Auxiliary training losses help, yet serving traffic can differ from training and needs live expert-level telemetry.

Sparse compute is paid for with routing varianceDense MLPall weights activeTop-k expertssparse computeImbalance taxHot experts create all-to-all and queue bottlenecks.Model benefitMore total parameters with bounded active compute.
The realized gain depends on expert balance and communication topology.

How to read this diagram: The bars compare Dense MLP with Top-k experts on the article's dominant cost axis. Their lengths are explanatory, not universal benchmark values. The design is worthwhile only when the stated gain, “More total parameters with bounded active compute.”, remains larger than the risk, “Hot experts create all-to-all and queue bottlenecks.”, under production traffic.

Placement should keep frequently co-selected experts near traffic and distribute hot experts across failure domains. Replication can relieve hot experts but complicates routing consistency and weight updates. The scheduler may combine tokens by expert to create efficient GEMMs while preserving each request’s deadline.

Token state through an MoE layerRoutedtop-k chosenDispatchedorigin recordedExecutedexpert output readyCombinedtoken order restoredEvery selected expert contribution is accounted for exactly once or explicitly dropped by policy.
Routing metadata is correctness state, not scheduler bookkeeping.

How to read this diagram: State advances from Routed to Dispatched, Executed, and finally Combined. The labels below each state identify what becomes true at that boundary. The governing invariant is: Every selected expert contribution is accounted for exactly once or explicitly dropped by policy. Retries and cancellation must preserve the same transition rules.

Four MoE serving constraintsBalanceTokens per expertCapacity overflowTopologyAll-to-all localityExpert placementBatchingExpert-local token groupsDeadline preservationResilienceExpert replica healthReroute semanticsDashboard p99 expert load and communication, not only layer averages.
MoE performance is governed by the busiest expert and slowest exchange.

How to read this diagram: The four panels are independent review axes: Balance, Topology, Batching, and Resilience. A design is incomplete when one panel is optimized while another is left implicit. Use the bottom note as the cross-panel operating rule: Dashboard p99 expert load and communication, not only layer averages.

One hot expert throttles the whole layerTraffic shiftsRouter favors expert 7Average looks normalExpert fillsToken queue growsCapacity overflowsLayer stallsAll ranks await combineTail latency spikesControlReplicate or rerouteAlert per expertPer-expert histograms reveal skew hidden by model-level utilization.
Sparse models require sparse observability at the same routing granularity.

How to read this diagram: This is a causal chain, not four unrelated symptoms. Traffic shifts triggers Expert fills, which creates Layer stalls. The green Control box is the intervention that should break the chain before users observe the final failure. The control must be tested under the initiating condition.

Primary references

The takeaway

Expert parallelism makes sparse models deployable by turning model capacity into a routing problem. The fastest expert layer is the one whose network and hottest rank remain under control.