Rubric-Based LLM Judge Evaluation

Vision-Language & Cross-Modal Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Rubric-Based LLM Judge Evaluation, Chain-of-Thought Prompting, Cross-Attention Mechanisms, Cohen's Kappa Coefficient, Few-Shot In-Context Learning, Logit Bias Adjustment, Natural Language Processing, Computer Vision, Evaluation Metrics, Prompt Engineering, Statistical Inference, Large Language Model Architectures, Multimodal Alignment, Automated Benchmarking, Instruction Tuning, Probabilistic Calibration.

Implement a function that performs rubric based evaluation of LLM outputs using multiple judges. This is a common technique in LLM as a judge evaluation pipelines where multiple language models score a response across different quality criteria. Your function should take: judge scores: A 2D list where judge scores[i][j] represents the score given by judge i for criterion j. Scores range from 0 to max score. criteria weights: A list of weights for each criterion (weights sum to 1) passing threshold: The minimum normalized score (0 to 1) required to pass (default: 0.6) max score: The maximum possible score for any criterion (default: 5.0) Your function should return a dictionary containing: weighted score: The overall weighted score combining all criteria and judges normalized score: The weighted score normalized to a 0 1 scale criterion scores: A list of average scores per criterion across all judges pass status: Boolean indicating if the response passes the threshold judge agreement: A metric from 0 to 1 indicating how much judges agree (1 = perfect agreement) For judge agreement, use a standard deviation based approach where agreement decreases as the average standard deviation across criteria increases relative to the maximum possible standard deviation.