Inference Head Pruning for Transformers
Attention Mechanisms DS practice problem on Onlearn.
Difficulty: easy.
Topics: Inference Head Pruning for Transformers, Multi-Head Attention, Head Importance Scoring, Sparsity Regularization, Attention Rollout, Weight Magnitude Thresholding, Deep Learning Theory, Model Compression, Natural Language Processing, Computational Complexity, Information Theory, Attention Mechanisms, Network Pruning, Transformer Architectures, Gradient-based Attribution, Parameter Efficiency.
Implement attention head pruning for transformer models to reduce inference cost. Given attention weights from all heads, importance scores, and pruning ratio, remove the least important heads while maintaining model performance. Head pruning exploits redundancy in transformers research shows 50% of BERT heads can be removed with <1% accuracy loss. Returns pruned attention weights and indices of kept heads. Used in DistilBERT, TinyBERT, and production serving.