Multiquery Attention (MQA)
Attention Mechanisms DS practice problem on Onlearn.
Difficulty: hard.
Topics: Understanding and Implementing Multi-Query Attention (MQA) for Efficient Inference, KV-Cache Compression, Head Broadcasting, Scaled Dot-Product Attention, Projection Weight Sharing, Multi-Head vs Multi-Query Scaling, Linear Algebra, Deep Learning, Natural Language Processing, Optimization Theory, Computational Complexity, Attention Mechanisms, Tensor Reshaping, Memory Bandwidth Bottlenecks, Inference Acceleration, Transformer Architectures.
Implement a Multi Query Attention (MQA) layer in PyTorch. Your implementation must accept a hidden dimension 'd model', the number of query heads 'num heads', and a single shared key value head. Your layer should perform the projection of Q, K, and V, reshape them for attention computation, and return the output tensor. Ensure the logic allows for broadcasting the single K and V heads across all Q heads during the scaled dot product attention phase.