Sparse MoE Top-K Routing

MoE, Compression & Scaling DS practice problem on Onlearn.

Difficulty: medium.

Topics: Sparse MoE Top-K Routing, Top-K Gating Mechanism, Expert Capacity Constraint, Router Z-Loss Regularization, All-to-All Communication, Token-to-Expert Assignment, Deep Learning Architectures, Distributed Systems, Numerical Optimization, Information Theory, Computational Complexity, Mixture of Experts, Load Balancing Algorithms, Sparse Matrix Operations, Parallel Computing Paradigms, Stochastic Gradient Estimation.

Implement the top k routing mechanism used in Sparse Mixture of Experts (MoE) models. In MoE architectures like Kimi K2 (384 experts, 8 active) and DeepSeek V3, each token is routed to only the top K experts based on router scores, enabling massive model capacity with fixed compute cost. Your task is to implement the routing logic that selects experts, computes routing weights via softmax over selected experts, and combines expert outputs.