Sigmoid MoE Router with Bias Correction
MoE, Compression & Scaling DS practice problem on Onlearn.
Difficulty: hard.
Topics: Understanding Sparse Mixture of Experts Routing Dynamics, Sigmoid Gating, Top-K Selection, Learnable Bias Parameters, Expert Routing Logic, Weight Initialization Strategies, Deep Learning, Linear Algebra, Optimization Theory, Neural Network Architectures, Probability Distributions, Sparse Mixture of Experts, Gating Mechanisms, Tensor Operations, Gradient-based Learning, Load Balancing.
Implement a Sigmoid based MoE router class. The router should take an input tensor of shape (batch size, seq len, hidden dim) and project it using a weight matrix (hidden dim, num experts). Add a learnable bias vector of size (num experts). Apply the sigmoid function to the output of this projection. Finally, return the top k expert indices and their corresponding gating scores.