Build a Transformer Encoder Layer

Attention Mechanisms DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding the architectural components of a Transformer Encoder Layer, Scaled Dot-Product Attention, Multi-Head Attention Projection, Layer Normalization, Residual Connections, Positional Encoding integration, Dropout regularization, Deep Learning, Sequence Modeling, Linear Algebra, Optimization, Neural Network Architectures, Attention Mechanisms, Encoder-Decoder Architectures, Normalization Techniques, Gradient Flow, Feed-Forward Networks.

Implement a Transformer Encoder Layer class in PyTorch. The layer must include: 1. Multi Head Self Attention, 2. A Position wise Feed Forward Network (two linear layers with a ReLU activation), 3. Two LayerNormalization steps, and 4. Residual connections around both the attention and feed forward sub layers. Assume the input dimension is 'd model'.