Log-Probability Ratio for Token Sequences
RLHF, Reward Modeling & Human Feedback DS practice problem on Onlearn.
Difficulty: medium.
Topics: Log-Probability Ratio for Token Sequences, Log-Probability Ratio, Kullback-Leibler Divergence, Bradley-Terry Model, Softmax Temperature Scaling, Token-level Logits, Reinforcement Learning, Probabilistic Graphical Models, Information Theory, Natural Language Processing, Statistical Inference, Policy Gradient Methods, Likelihood Estimation, Sequence Modeling, Preference Alignment, Entropy Regularization.
In reinforcement learning from human feedback (RLHF) and related alignment methods, a critical quantity is the log probability ratio between a policy model and a reference model evaluated over a sequence of tokens. This ratio measures how much the policy has diverged from the reference on each generated token. Given token level log probabilities from a policy model and a reference model for a batch of sequences (which may have different lengths due to padding), implement a function that computes: 1. Per token log probability ratios : For each token position, compute the difference between the policy log prob and the reference log prob. Padded positions (where mask is 0) should have a ratio of 0. 2. Sequence level log probability ratios : For each sequence in the batch, sum the per token log ratios over all valid (non padded) tokens. 3. Mean sequence log probability ratio : The average of the sequence level ratios across the batch. Your function should accept: policy log probs: a numpy array of shape (B, T) containing log probabilities under the policy reference log probs: a numpy array of shape (B, T) containing log probabilities under the reference model mask: a numpy array of shape (B, T) with 1 for real tokens and 0 for padding Return a dictionary with keys 'per token log ratios' (shape (B, T)), 'sequence log ratios' (shape (B,)), and 'mean sequence log ratio' (a float scalar).