GSPO: Group Sequence Policy Optimization

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: hard.

Topics: GSPO: Group Sequence Policy Optimization, Kullback-Leibler Divergence, Proximal Policy Optimization, Group Equivariance, Trajectory Sampling, Advantage Estimation, Reinforcement Learning, Stochastic Optimization, Game Theory, Information Theory, Statistical Learning Theory, Policy Gradient Methods, Multi-Agent Coordination, Sequence Modeling, Trust Region Methods, Variational Inference.

Implement Group Sequence Policy Optimization (GSPO), a reinforcement learning algorithm for training large language models. Unlike token level methods like GRPO that suffer from training instability, GSPO uses sequence level importance ratios with length normalization. Given log probabilities from new and old policies along with rewards for a group of sequences, compute the clipped objective. The algorithm prevents catastrophic model collapse by applying clipping at the sequence level rather than token level.