A Sparse Mixture of Experts Layer

MoE, Compression & Scaling DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Implement a Sparse Mixture of Experts Layer, Softmax Gating, Top-k Selection, Expert Parallelism, Load Balancing Loss, Sparse Activation, Deep Learning Architectures, Distributed Computing, Numerical Linear Algebra, Probabilistic Modeling, Computational Complexity, Transformer Mechanisms, Sparse Model Scaling, Tensor Operations, Gating and Routing Algorithms, Parallel Processing Paradigms.

Implement a Mixture of Experts (MoE) layer using softmax gating and top k routing. Given an input tensor, a set of expert weight matrices, a gating weight matrix, and parameters specifying the number of experts and the value of k, compute the final MoE output by selecting the top k experts per token, applying their transformations, and aggregating the results weighted by the normalized gating probabilities.