Expert Parallelism Token Routing and Communication Cost

MoE, Compression & Scaling DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Expert Parallelism Token Routing and Communication Cost, Top-k Routing, All-to-All Communication, Expert Capacity Factor, Softmax Gating, Load Balancing Loss, Distributed Systems, Deep Learning Architecture, Linear Algebra, Optimization Theory, High-Performance Computing, Mixture of Experts (MoE), Parallel Computing, Collective Communication, Tensor Parallelism, Sparse Activation.

Implement a simplified Expert Router for a Mixture of Experts (MoE) model. Given a batch of token embeddings and a set of expert weights, compute the routing indices (top k) and calculate the simulated communication cost, defined as the number of tokens that must be sent across devices assuming each expert resides on a distinct node.