Gated Attention

Attention Mechanisms DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Gated Attention Mechanisms in Transformers, Scaled Dot-Product Attention, Learnable Gating Weights, Sigmoid Gating Function, Multi-Head Projection, Residual Connections, Element-wise Multiplication, Deep Learning, Natural Language Processing, Linear Algebra, Optimization Algorithms, Neural Network Architectures, Attention Mechanisms, Transformer Blocks, Activation Functions, Tensor Operations, Normalization Layers.

Implement a Gated Attention layer in PyTorch. The layer should take an input tensor 'x' of shape (batch, seq len, dim), compute standard scaled dot product attention, and then apply a gating mechanism where the output is calculated as: Output = Attention(Q, K, V) sigmoid(Linear(x)). Ensure the implementation handles multi head attention internally.