Relativistic Critic Rewards for Adversarial Reasoning
RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.
Difficulty: medium.
Topics: Relativistic Critic Rewards for Adversarial Reasoning, Bradley-Terry Model, Minimax Regret, Kullback-Leibler Divergence, Generalized Advantage Estimation, Nash Equilibrium Convergence, Reinforcement Learning, Game Theory, Optimization Theory, Probabilistic Graphical Models, Information Theory, Preference Modeling, Adversarial Training, Policy Gradient Methods, Value Function Approximation, Multi-Agent Coordination.
Implement the reward computation for the Relativistic Adversarial Reasoning Optimization (RARO) algorithm. RARO trains LLMs to reason using only expert demonstrations without task specific verifiers. It sets up an adversarial game between a policy (which generates answers) and a relativistic critic (which compares policy vs expert answers). Given a critic's prediction (which answer is better: 'expert', 'policy', or 'tie') and the ground truth labels, compute both the critic reward and the policy reward. The critic is rewarded for correctly identifying the expert answer, while the policy is rewarded when the critic mistakes its answer for the expert's. Both can receive partial rewards for 'tie' predictions based on configurable tie reward parameters.