MMLU Letter-Matching Evaluation

Text Generation & NLP Evaluation DS practice problem on Onlearn.

Difficulty: medium.

Topics: MMLU Letter-Matching Evaluation, Multi-Head Self-Attention, Dilated Spatial Pooling, Proximal Policy Optimization, Conjugate Prior Distribution, Nesterov Momentum, Natural Language Processing, Computer Vision, Reinforcement Learning, Statistical Learning Theory, Optimization Algorithms, Transformer Architectures, Convolutional Neural Networks, Policy Gradient Methods, Bayesian Inference, Stochastic Gradient Descent.

Implement a function to evaluate language model responses on MMLU (Massive Multitask Language Understanding) benchmark questions using letter matching. In MMLU evaluation, models are given multiple choice questions with options A, B, C, or D. The model generates a text response, and we need to extract the predicted answer letter and compare it against the ground truth. Given: model outputs: A list of strings containing the model's generated responses ground truth: A list of correct answer letters (A, B, C, or D) subjects: A list of subject/category names for each question Your function should: 1. Extract the predicted letter from each model output (handle various formats like 'A', 'a', 'A.', '(A)', 'A)', or phrases containing the answer) 2. Compare predictions against ground truth 3. Track valid vs invalid responses (outputs where no letter can be extracted) 4. Calculate accuracy metrics overall and per subject Return a dictionary containing: 'overall accuracy': Proportion of correct predictions out of all questions 'subject accuracy': Dictionary mapping each subject to its accuracy 'valid response rate': Proportion of responses where a valid letter was extracted 'total correct': Number of correct predictions 'total questions': Total number of questions Note: Invalid responses (where no letter can be extracted) count as incorrect for accuracy calculation.