Knowledge Distillation Loss
Sequence Models & Generative Models DS practice problem on Onlearn.
Difficulty: medium.
Topics: Knowledge Distillation Loss, Temperature Scaling, Logit Matching, Cross-Entropy Loss, Weight Pruning, Feature Map Alignment, Information Theory, Optimization Theory, Probabilistic Graphical Models, Neural Network Architectures, Supervised Learning, Kullback-Leibler Divergence, Stochastic Gradient Descent, Softmax Normalization, Teacher-Student Frameworks, Model Compression.
Implement the knowledge distillation loss used to transfer capabilities from a large teacher model to a smaller student model. Distillation trains the student to match the teacher's soft probability distribution rather than hard labels, enabling smaller models to achieve performance closer to their larger counterparts. Compute the KL divergence between temperature softened teacher and student distributions.