Group Relative Advantage for GRPO

RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.

Difficulty: easy.

Topics: Group Relative Advantage for GRPO, Group Relative Policy Optimization, Advantage Normalization, KL-Divergence Penalty, Monte Carlo Sampling, Reward Model Calibration, Reinforcement Learning, Optimization Theory, Probability and Statistics, Natural Language Processing, Distributed Systems, Policy Gradient Methods, Preference Modeling, Variance Reduction Techniques, Stochastic Approximation, Multi-Agent Coordination.

Implement the Group Relative Advantage calculation used in GRPO (Group Relative Policy Optimization) from the DeepSeek R1 paper. In GRPO, for each prompt, the model generates a group of G outputs. Each output receives a reward, and the advantage for each output is computed by normalizing rewards within the group. This normalization ensures that the policy update is relative to other outputs for the same prompt, which is key to GRPO's effectiveness.