Build Scaled Dot-Product Attention

Attention Mechanisms DS practice problem on Onlearn.

Difficulty: medium.

Topics: Understanding Scaled Dot-Product Attention in Transformer Architectures, Query-Key-Value paradigm, Attention Scaling Factor, Numerical Stability in Softmax, Sequence Length Handling, Embedding Dimension Normalization, Linear Algebra, Deep Learning, Natural Language Processing, Optimization, Probability Theory, Matrix Transposition, Softmax Normalization, Vector Dot Products, Broadcasting Operations, Tensor Dimensionality.

Implement the Scaled Dot Product Attention function. The function takes Queries (Q), Keys (K), and Values (V) as input tensors. The formula is: Attention(Q, K, V) = softmax((QK^T) / sqrt(d k))V. Assume Q, K, and V are 2D tensors of shape (seq len, d k).