Reward Model Validation Accuracy

RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.

Difficulty: easy.

Topics: Reward Model Validation Accuracy, Bradley-Terry Model, Cross-Entropy Loss, Validation Split, Expected Calibration Error, Log-Likelihood Ratio, Reinforcement Learning, Statistical Learning Theory, Natural Language Processing, Optimization Theory, Evaluation Metrics, Preference Modeling, Supervised Fine-Tuning, Generalization Analysis, Stochastic Gradient Descent, Calibration Techniques.

In Reinforcement Learning from Human Feedback (RLHF), a reward model is trained on pairwise human preference data to assign scalar scores to model responses. Before using the reward model in the RL loop, it is critical to evaluate how well it ranks responses on held out preference pairs. Implement a function reward model validation that takes arrays of reward scores assigned to chosen (preferred) and rejected responses, along with a list of margin thresholds, and returns a dictionary of validation metrics. The function should compute the following metrics: accuracy: the fraction of preference pairs where the chosen response received a strictly higher reward score than the rejected response. mean margin: the average difference between chosen and rejected scores across all pairs. concordance: similar to accuracy, but ties (where chosen and rejected scores are equal) count as half correct rather than incorrect. margin accuracy: a dictionary mapping each given threshold to the fraction of pairs where the reward margin (chosen minus rejected) is greater than or equal to that threshold. All returned float values should be rounded to 4 decimal places. Args: chosen scores: 1D numpy array of shape (n,) with reward scores for chosen responses rejected scores: 1D numpy array of shape (n,) with reward scores for rejected responses margin thresholds: list of float thresholds for margin based accuracy Returns: A dictionary with keys 'accuracy', 'mean margin', 'concordance', and 'margin accuracy'.