MuonClip (qk-clip) for Stabilizing Attention
Backpropagation, Training & Optimization DS practice problem on Onlearn.
Difficulty: hard.
Topics: Understanding MuonClip (QK-Clip) for Attention Stability, L2 Vector Normalization, Logit Clipping, Scaled Dot-Product Attention, Softmax Temperature Scaling, Head-wise Normalization, Deep Learning Optimization, Attention Mechanisms, Numerical Stability, Gradient Dynamics, Matrix Operations, Self-Attention Architecture, Layer Normalization Variants, Weight Initialization, Logit Scaling, Transformer Training Stability.
Implement a 'QK Clip' (or QK Norm) mechanism within a standard scaled dot product attention module. Your implementation should normalize the Query (Q) and Key (K) tensors using their L2 norms across the head dimension before computing the attention scores, and then apply a clamping function to the resulting logits to ensure numerical stability.