PTX Loss for Catastrophic Forgetting Prevention (RLHF)

RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.

Difficulty: medium.

Topics: PTX Loss for Catastrophic Forgetting Prevention (RLHF), Kullback-Leibler Divergence, Pre-training Objective, Catastrophic Forgetting, Proximal Policy Optimization, Reward Model Calibration, Reinforcement Learning, Natural Language Processing, Optimization Theory, Information Theory, Statistical Learning, Preference Alignment, Continual Learning, Language Modeling, Gradient-based Regularization, Policy Optimization.

Implement PTX (Pre training) Loss to prevent catastrophic forgetting during RLHF. PTX Loss combines reinforcement learning objectives with cross entropy loss on pre training data, maintaining general language capabilities while optimizing for reward. Given RL loss, model logits on pre training batch, true labels, and beta coefficient, compute total loss. Used in InstructGPT, ChatGPT, Kimi K2, and other RLHF systems.