Pre-Norm vs Post-Norm Transformer Block
Initialization, Normalization & Regularization DS practice problem on Onlearn.
Difficulty: hard.
Topics: Understanding the architectural impact of Layer Normalization placement in Transformer blocks, Pre-norm path stability, Post-norm gradient flow, Identity mapping path, Normalization placement, Training convergence speed, Deep Learning, Natural Language Processing, Optimization Algorithms, Gradient Dynamics, Neural Network Architecture, Layer Normalization, Residual Connections, Vanishing/Exploding Gradients, Transformer Blocks, Attention Mechanisms.
Implement two classes, PostNormBlock and PreNormBlock, using PyTorch. Each block should consist of a Multi Head Attention layer and a Feed Forward Network. Ensure that for Post Norm, the LayerNorm is applied after the residual addition, and for Pre Norm, it is applied before the sub layer transformation.