Instability in the Deadly Triad
Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.
Difficulty: medium.
Topics: Instability in the Deadly Triad, Saddle Point Dynamics, Nash Equilibrium Convergence, Lyapunov Exponents, Rademacher Complexity, Spectral Radius, Optimization Theory, Game Theory, Dynamical Systems, Statistical Learning Theory, Numerical Linear Algebra, Non-convex Optimization, Multi-Agent Reinforcement Learning, Stability Analysis, Generalization Bounds, Eigenvalue Decomposition.
Implement a function that demonstrates the Deadly Triad in reinforcement learning: the combination of (1) function approximation, (2) bootstrapping, and (3) off policy learning can cause value estimates to diverge. You will simulate a 7 state MDP (states 0 6) with two actions: Action 0 ("solid"): transitions uniformly to one of states 0 5 Action 1 ("dashed"): always transitions to state 6 All rewards are 0 The behavior policy selects action 0 with probability 6/7 and action 1 with probability 1/7. The target policy always selects action 1. You must run two experiments using semi gradient TD(0) with importance sampling: 1. Triad scenario : Use a linear function approximation with a specific 7x8 feature matrix where states 0 5 have features [0,...,2,...,0,1] (2 in position i, 1 in position 7), and state 6 has features [0,0,0,0,0,0,1,2]. Initial weight vector: [1,1,1,1,1,1,10,1]. 2. Stable scenario : Use a tabular representation (7x7 identity feature matrix, removing function approximation from the triad). Initial weight vector: [1,1,1,1,1,1,10]. Both experiments use the same behavior/target policies and the same random seed for fair comparison. At each step, sample an action from the behavior policy, compute the importance sampling ratio, observe the transition, and apply the semi gradient TD(0) update. Clip all weight values to [ 1e10, 1e10] for numerical stability. Define divergence as the final weight norm exceeding 2x the initial weight norm. Return a dictionary with keys: 'triad diverged': bool 'stable diverged': bool 'triad norm ratio': round(final norm / initial norm, 4) 'stable norm ratio': round(final norm / initial norm, 4) Use np.random.RandomState for reproducibility, with separate instances (same seed) for each scenario.