Reward Model Loss from Pairwise Human Preferences
RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.
Difficulty: medium.
Topics: Reward Model Loss from Pairwise Human Preferences, Bradley-Terry Model, Log-Sigmoid Activation, Cross-Entropy Loss, Monte Carlo Sampling, KL-Divergence Regularization, Reinforcement Learning, Supervised Learning, Optimization Theory, Probability and Statistics, Natural Language Processing, Preference Modeling, Stochastic Gradient Descent, Bayesian Inference, Sequence Modeling, Loss Function Design.
In Reinforcement Learning from Human Feedback (RLHF), a reward model is trained to predict human preferences between pairs of responses. Given a dataset of preference pairs where humans have indicated which response they prefer (chosen) over another (rejected), the reward model learns to assign higher scalar scores to preferred responses. Implement a function that computes the reward model training loss and accuracy given arrays of reward scores for chosen and rejected responses. The function should: Accept lists of reward scores for chosen (preferred) responses and rejected (non preferred) responses of equal length Accept an optional margin parameter (default 0.0) that encourages a minimum gap between chosen and rejected scores Return a dictionary with two keys: 'loss': the average negative log sigmoid of the score differences (with margin subtracted), rounded to 4 decimal places 'accuracy': the fraction of pairs where the chosen reward strictly exceeds the rejected reward, rounded to 4 decimal places Your implementation should be numerically stable for both very large and very small score differences.