Baird's Counterexample: Off-Policy TD Divergence
Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.
Difficulty: medium.
Topics: Baird's Counterexample: Off-Policy TD Divergence, Deadly Triad, Baird's Star Graph, Bellman Residual Minimization, Weight Divergence, Target Policy Mismatch, Reinforcement Learning, Numerical Analysis, Control Theory, Stochastic Processes, Functional Analysis, Temporal Difference Learning, Off-Policy Evaluation, Function Approximation, Dynamic Programming, Stability Analysis.
Implement a function that runs semi gradient off policy TD(0) on a classic counterexample that demonstrates divergence when combining function approximation, bootstrapping, and off policy learning. The environment has 7 states (indexed 0 through 6). There are two actions: Dashed : transitions to a uniformly random state among states 0 5 Solid : always transitions to state 6 All transitions yield zero reward. The behavior policy selects dashed with probability 6/7 and solid with probability 1/7. The target policy always selects solid. The value function is approximated linearly as V(s) = phi(s)^T w, where w is a weight vector of size 8 and the feature vectors are: States 0 5: phi(s) has a 2 in position s, a 1 in position 7, and 0 elsewhere State 6: phi(6) has a 1 in position 6, a 2 in position 7, and 0 elsewhere Since the target policy always selects solid (going to state 6) and the behavior policy selects solid with probability 1/7, the importance sampling ratio rho is 7 for solid actions and 0 for dashed actions. Only solid action transitions produce non zero updates. Your function receives a list of states from which the solid action was taken (since dashed action updates are zero and can be ignored). For each state s in this list, apply the semi gradient off policy TD(0) update: w < w + alpha rho (reward + gamma phi(next state)^T w phi(s)^T w) phi(s) where next state is always 6, reward is always 0, and rho is always 7.