Computing Optimal Model Size with Scaling Laws

MoE, Compression & Scaling DS practice problem on Onlearn.

Difficulty: medium.

Topics: Understanding Neural Scaling Laws and Compute-Optimal Model Sizing, Chinchilla Scaling Laws, Floating Point Operations (FLOPs) Estimation, Optimal Parameter Allocation, Training Data Constraints, Model Capacity Scaling, Computational Complexity, Deep Learning Theory, Statistical Modeling, Optimization Theory, Information Theory, Power Law Scaling, Compute-Optimal Training, Parameter-to-Token Ratios, Loss Minimization, FLOPs Budgeting.

Implement a function 'get optimal params' that calculates the compute optimal number of parameters for a Large Language Model given a total compute budget (in FLOPs) and an efficiency constant. According to Chinchilla scaling laws, compute C is approximately proportional to 6 N D, where N is the number of parameters and D is the number of tokens. Assuming the optimal ratio D = 20 N, find the optimal N for a given total compute budget C.