Fine-Tune Model Weights with RLHF Policy Gradient

RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.

Difficulty: medium.

Topics: Fine-Tune Model Weights with RLHF Policy Gradient, Proximal Policy Optimization, Kullback-Leibler Divergence, Bradley-Terry Model, Advantage Estimation, Entropy Regularization, Reinforcement Learning, Natural Language Processing, Optimization Theory, Probabilistic Graphical Models, Statistical Learning Theory, Policy Gradient Methods, Reward Modeling, Transformer Architectures, Stochastic Approximation, Preference Ranking.

Implement a function that performs a single RLHF (Reinforcement Learning from Human Feedback) weight update using the policy gradient method. Given: weights: A 1D numpy array of current model weights rewards: A 1D numpy array of rewards from a reward model for each sample policy log probs: A 1D numpy array of log probabilities from the current policy ref log probs: A 1D numpy array of log probabilities from the reference model log prob grads: A 2D numpy array of shape (batch size, num weights) where each row is the gradient of that sample's log probability with respect to the weights beta: KL penalty coefficient lr: Learning rate In RLHF, the goal is to update the model weights so that the policy maximizes reward from human feedback while staying close to a reference model. The KL divergence between the policy and reference can be estimated per sample from log probability differences. This KL estimate is used to penalize the reward, producing an adjusted reward signal. The policy gradient is then computed by weighting each sample's gradient by its adjusted reward, and the weights are updated via gradient ascent. Return the updated weights as a 1D numpy array.