Pre-Norm vs Post-Norm Transformer Block

Initialization, Normalization & Regularization DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding the architectural impact of Layer Normalization placement in Transformer blocks, Pre-norm path stability, Post-norm gradient flow, Identity mapping path, Normalization placement, Training convergence speed, Deep Learning, Natural Language Processing, Optimization Algorithms, Gradient Dynamics, Neural Network Architecture, Layer Normalization, Residual Connections, Vanishing/Exploding Gradients, Transformer Blocks, Attention Mechanisms.

Implement two classes, PostNormBlock and PreNormBlock, using PyTorch. Each block should consist of a Multi Head Attention layer and a Feed Forward Network. Ensure that for Post Norm, the LayerNorm is applied after the residual addition, and for Pre Norm, it is applied before the sub layer transformation.

dsInitialization, Normalization & Regularization

Tutor

Waking the tutor…

Initialization, Normalization & Regularization

0 of 17 solved

Back to roadmap