QK-Norm (Query-Key Normalization)

Initialization, Normalization & Regularization DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding QK-Norm (Query-Key Normalization) in Transformer Architectures, QK-Norm, Dot-Product Attention, Mean-Variance Normalization, Affine-free LayerNorm, Softmax Temperature Scaling, Linear Algebra, Deep Learning, Optimization, Neural Network Architectures, Attention Mechanisms, Self-Attention, Layer Normalization, Logit Scaling, Numerical Stability, Transformer Blocks.

Implement the QK Norm layer as described in 'Transformers without Tears'. Given an input tensor of shape (batch, heads, seq len, head dim), normalize the Query and Key tensors independently using LayerNorm (without learnable parameters) along the last dimension, then compute the attention scores.