RAFT: Iterative Reward-Ranked Fine-Tuning Loop

RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.

Difficulty: medium.

Topics: RAFT: Iterative Reward-Ranked Fine-Tuning Loop, Kullback-Leibler Divergence, Bradley-Terry Model, Rejection Sampling, Proximal Policy Optimization, Logit Bias Adjustment, Reinforcement Learning, Supervised Fine-Tuning, Preference Modeling, Optimization Theory, Natural Language Processing, Policy Gradient Methods, Reward Function Engineering, Iterative Data Augmentation, Contrastive Learning, Sequence Modeling Architectures.

Implement the RAFT (Reward rAnked FineTuning) alignment loop that iterates over T stages to progressively improve a generative model. At each stage t, RAFT performs three steps: Step 1 — Data Collection: A batch of b prompts is sampled, and K candidate responses are generated per prompt from the current model. (These are provided as input.) Step 2 — Data Ranking: For each prompt, use the reward model to score all K candidates, then select the response with the highest reward: y = argmax j r(x, y j). Collect these best responses into a batch B of size b. Step 3 — Model Fine Tuning: Compute the SFT (supervised fine tuning) loss on batch B, defined as the mean negative log likelihood over all tokens in the selected responses. Given: stages log probs: Token level log probabilities across T stages. Shape: [T][num prompts][K][num tokens]. stages log probs[t][i][j][k] is the log prob of token k in response j for prompt i at stage t. stages rewards: Reward scores across T stages. Shape: [T][num prompts][K]. stages rewards[t][i][j] is the reward for response j of prompt i at stage t. Return a list of T SFT losses (one per stage), each rounded to 4 decimal places. As the model improves across stages, the losses should generally decrease.

dsRLHF, Reward Modeling & Human Feedback

Tutor

Waking the tutor…