MMLU Log-Probability Scoring

Text Generation & NLP Evaluation DS practice problem on Onlearn.

Difficulty: medium.

Topics: MMLU Log-Probability Scoring, Log-Probability Normalization, Kullback-Leibler Divergence, Beam Search Constraints, Posterior Predictive Distribution, Big-O Notation, Probabilistic Modeling, Information Theory, Natural Language Processing, Statistical Inference, Computational Complexity, Maximum Likelihood Estimation, Entropy and Divergence, Language Model Decoding, Bayesian Parameter Estimation, Asymptotic Analysis.

Implement a function to evaluate multiple choice question answering using log probability scoring, as commonly used in MMLU (Massive Multitask Language Understanding) benchmark evaluation. Given a set of multiple choice questions where each question has log probabilities assigned to each answer choice by a language model, your task is to: 1. Determine the predicted answer for each question (the choice with the highest log probability) 2. Calculate the overall accuracy (proportion of questions where the prediction matches the correct answer) 3. Compute the average probability assigned to the correct answer across all questions (convert log probabilities to probabilities first) The function should take: log probs: A list of lists where each inner list contains the log probabilities for each answer choice for a question correct answers: A list of integers representing the correct answer index (0 indexed) for each question Return a dictionary with keys 'accuracy', 'predictions', and 'avg correct prob'. Note: When converting log probabilities to probabilities, ensure numerical stability in your implementation.