Knowledge Distillation Loss
Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.
Difficulty: medium.
Topics: Knowledge Distillation Loss, Kullback-Leibler Divergence, Temperature Scaling, Soft Targets, Logit Mimicry, Feature Map Distillation, Deep Learning Theory, Model Compression, Information Theory, Optimization Methods, Supervised Learning, Teacher-Student Architectures, Probability Distribution Matching, Regularization Techniques, Gradient-based Learning, Representation Alignment.
Implement the knowledge distillation loss used to transfer capabilities from a large teacher model to a smaller student model. Distillation trains the student to match the teacher's soft probability distribution rather than hard labels, enabling smaller models to achieve performance closer to their larger counterparts. Compute the KL divergence between temperature softened teacher and student distributions.