Get started
Start with Mathematical Foundations
Vectors & Geometry
Back to modulesModule 14 · 0/13
Module 14
Alignment & Preference Learning
0/13391
Submodule 01RLHF, Reward Modeling & Human Feedback0/13
- 1The GRPO Objective FunctionHard
- 2Reward Model Loss from Pairwise Human PreferencesconceptMed.
- 3Reward Model Validation AccuracyconceptEasy
- 4Direct Preference Optimization (DPO) LossconceptMed.
- 5Dr. GRPO: Complete Objective FunctionconceptMed.
- 6Group Relative Advantage for GRPOconceptEasy
- 7Fine-Tune Model Weights with RLHF Policy GradientconceptMed.
- 8Adaptive KL-Penalized Reward Shaping for RLHFconceptMed.
- 9PTX Loss for Catastrophic Forgetting Prevention (RLHF)conceptMed.
- 10RAFT: Iterative Reward-Ranked Fine-Tuning LoopconceptMed.
- 11Relativistic Critic Rewards for Adversarial ReasoningconceptMed.
- 12KL Divergence Estimator for GRPOconceptEasy
- 13Log-Probability Ratio for Token SequencesconceptMed.