Mixture of Experts Load Balancing Loss

MoE, Compression & Scaling DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding and Implementing MoE Auxiliary Load Balancing Loss, Expert Capacity, Softmax Routing, Load Balancing Penalty, Auxiliary Loss, Expert Routing Distribution, Deep Learning, Optimization, Linear Algebra, Distributed Computing, Neural Network Architecture, Sparse Activation, Routing Mechanisms, Gradient Descent, Loss Functions, Tensor Operations.

Implement a function load balancing loss(router probs, expert indices, num experts, loss coef) that calculates the auxiliary loss for a Mixture of Experts layer. The loss is defined as the scaled dot product of the mean routing probability per expert and the fraction of tokens assigned to each expert, multiplied by the number of experts.