Dr. GRPO: Complete Objective Function

RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.

Difficulty: medium.

Topics: Dr. GRPO: Complete Objective Function, Group Relative Policy Optimization, Kullback-Leibler Penalty, Advantage Normalization, Monte Carlo Sampling, Reference Model Constraint, Reinforcement Learning, Optimization Theory, Probability and Statistics, Natural Language Processing, Information Theory, Policy Gradient Methods, Reward Modeling, Stochastic Approximation, Sequence Modeling, Divergence Measures.

Implement the complete Dr. GRPO (GRPO Done Right) objective function for reinforcement learning with large language models. Dr. GRPO fixes two critical biases in GRPO: (1) response level length bias from 1/|o i| normalization, and (2) question level difficulty bias from std normalization. The objective uses unbiased advantages (reward minus mean) and computes token level clipped importance ratios summed over all tokens. Given log probabilities from new and old policies, rewards, and clipping parameter epsilon, compute the Dr. GRPO objective value.