Pairwise Preference Judge for LLM Comparison
Vision-Language & Cross-Modal Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Pairwise Preference Judge for LLM Comparison, Log-Likelihood Ratio, Visual Token Embedding, Elo Rating System, Softmax Temperature Scaling, Inter-Annotator Agreement, Reinforcement Learning from Human Feedback, Natural Language Processing, Computer Vision, Statistical Evaluation Metrics, Large Language Model Architecture, Bradley-Terry Modeling, Cross-Modal Alignment, Reward Model Training, Inference-Time Decoding Strategies, Dataset Curation and Annotation.
Implement a pairwise preference judge that compares responses from two LLMs across multiple evaluation criteria. In LLM evaluation, pairwise comparison is a common approach where human annotators or automated systems compare two model outputs and determine which is better. This function should analyze such comparisons. Function Inputs: comparisons: A list of comparison dictionaries, each containing: 'id': Unique identifier for the comparison 'scores a': Dictionary mapping criterion names to scores for response A 'scores b': Dictionary mapping criterion names to scores for response B criteria weights: Dictionary mapping criterion names to their importance weights tie threshold: Float threshold if the absolute weighted score difference is less than or equal to this value, declare a tie Function Output: A dictionary containing: 'results': List of result dictionaries, each with: 'id': The comparison identifier 'winner': String 'A', 'B', or 'tie' 'margin': The absolute weighted score difference (rounded to 4 decimal places) 'win rate a': Proportion of comparisons won by A (rounded to 4 decimal places) 'win rate b': Proportion of comparisons won by B (rounded to 4 decimal places) 'tie rate': Proportion of tied comparisons (rounded to 4 decimal places) 'avg margin': Average absolute margin across all comparisons (rounded to 4 decimal places) Note: Weights should be normalized so they sum to 1 when computing weighted scores. For empty comparison lists, return all rates and average margin as 0.0.