Dynamic Tanh: Normalization-Free Transformer Activation

Neural Units & Activations DS practice problem on Onlearn.

Difficulty: easy.

Topics: Understanding Dynamic Tanh: Normalization-Free Transformer Activation, Non-linear Squashing Functions, Internal Covariate Shift, Vanishing Gradient Mitigation, Learnable Scaling Parameters, Normalization-Free Architectures, Deep Learning Foundations, Numerical Optimization, Transformer Architectures, Signal Processing, Computational Linear Algebra, Neural Units and Activations, Normalization Techniques, Gradient Flow Dynamics, Attention Mechanisms, Weight Initialization Strategies.

Implement the Dynamic Tanh (DyT) function, a normalization free transformation inspired by the Tanh function. DyT replaces layer normalization in Transformer architectures while preserving squashing behavior and enabling stable training.