Efficient Sparse Window Attention
Attention Mechanisms DS practice problem on Onlearn.
Difficulty: medium.
Topics: Understanding Implement Efficient Sparse Window Attention, Dot-Product Attention, Softmax Normalization, Local Receptive Fields, Index Masking, Broadcasting Semantics, Deep Learning, Numerical Linear Algebra, Sequence Modeling, Computational Complexity, Parallel Computing, Attention Mechanisms, Sparse Matrix Operations, Sliding Window Algorithms, Transformer Architectures, Vectorized Array Manipulation.
Create a function named sparse window attention that computes sparse attention over long sequences by sliding a fixed radius window across the sequence. • The parameter window size represents the radius w of the window. For a token at index i, attend only to tokens whose indices are within max(0, i w) through min(seq len 1, i + w), inclusive. Tokens near the beginning or end of the sequence simply have smaller windows; no padding is added. • Inputs Q, K, V: NumPy arrays with shapes (seq len, d k) for Q and K, and (seq len, d v) for V. window size: integer window radius. scale factor (optional): value used to scale dot product scores; if None, default to sqrt(d k). • Output A NumPy array of shape (seq len, d v) containing the attention results.