Residual Gradient Algorithm for Value Function Approximation
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Residual Gradient Algorithm for Value Function Approximation, Residual Gradient, Fixed-Point Iteration, Contraction Mapping, Mean Squared Bellman Error, Semi-Gradient Descent, Reinforcement Learning, Numerical Optimization, Statistical Learning Theory, Control Theory, Functional Analysis, Temporal Difference Learning, Stochastic Approximation, Function Approximation, Bellman Equation Operators, Gradient-Based Parameter Updates.
Implement the Residual Gradient algorithm for policy evaluation with linear function approximation in reinforcement learning. Standard semi gradient TD methods only differentiate part of the temporal difference error with respect to the weights, ignoring the gradient through the bootstrapped target. The residual gradient algorithm instead minimizes the full mean squared Bellman error by computing the complete gradient, including the contribution from the next state value estimate. For linear function approximation, the value of state s is V(s; w) = phi(s)^T w, where phi(s) is the feature vector for state s. Given: episodes: a list of episodes, where each episode is a list of (state, reward, next state) tuples. A next state of 1 indicates a terminal transition (next state value is 0, feature vector is the zero vector). features: a numpy array of shape (n states, d) containing the feature vector for each state. gamma: the discount factor. alpha: the step size (learning rate). n passes: the number of complete passes through all episodes. Your function should: 1. Initialize the weight vector w to zeros. 2. For each pass, iterate through every episode and every transition in order. 3. For each transition, compute the TD error and update the weights using the residual gradient update rule (the full gradient of the squared TD error with respect to w). 4. Return the final weight vector as a 1D numpy array of shape (d,).