KL Divergence Estimator for GRPO
RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.
Difficulty: easy.
Topics: KL Divergence Estimator for GRPO, Kullback-Leibler Divergence, Group Relative Policy Optimization, Log-Probability Ratios, Monte Carlo Sampling, Reference Model Calibration, Information Theory, Reinforcement Learning, Probability Theory, Numerical Optimization, Deep Learning Architectures, Relative Entropy Measures, Policy Gradient Methods, Distributional Modeling, Stochastic Approximation, Transformer Decoder Blocks.
Implement the unbiased KL divergence estimator used in GRPO (Group Relative Policy Optimization). This estimator computes the KL divergence between the current policy and a reference policy for each sample, which is then used as a regularization term to prevent the policy from deviating too far from the reference.