Adaptive KL-Penalized Reward Shaping for RLHF

RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.

Difficulty: medium.

Topics: Adaptive KL-Penalized Reward Shaping for RLHF, Kullback-Leibler Penalty, Proximal Policy Optimization, Bradley-Terry Model, Lagrangian Multipliers, Reward Shaping, Reinforcement Learning, Information Theory, Optimization Theory, Natural Language Processing, Statistical Learning Theory, Policy Gradient Methods, Divergence Measures, Constrained Optimization, Language Model Alignment, Preference Modeling.

In Reinforcement Learning from Human Feedback (RLHF), a key challenge is preventing the policy model from diverging too far from the reference model while maximizing reward. KL penalized reward shaping modifies the raw reward signal by subtracting a penalty proportional to the KL divergence between the policy and reference distributions. In practice, systems often use an adaptive KL coefficient (beta) that automatically adjusts based on whether the observed KL divergence is within a desired target range. Implement a function adaptive kl penalized reward that processes multiple responses generated by a policy model. For each response, you are given: A scalar reward from a reward model Per token log probabilities under the policy model Per token log probabilities under the reference model Your function should: 1. Compute the per response KL divergence between policy and reference (sum of per token log probability differences) 2. Compute the shaped reward for each response by penalizing the original reward with the KL term scaled by beta 3. Compute the mean KL across all responses 4. Adaptively update beta: If mean KL exceeds 1.5 times the KL target, multiply beta by beta update factor If mean KL falls below the KL target divided by 1.5, divide beta by beta update factor Otherwise, keep beta unchanged Return a dictionary containing per response KL values, the mean KL, shaped rewards, and the updated beta. All values should be rounded to 4 decimal places.

dsRLHF, Reward Modeling & Human Feedback

Tutor

Waking the tutor…