Gradient Monte Carlo Algorithm

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Gradient Monte Carlo Algorithm, Semi-gradient Updates, Bootstrapping, Weight Vector Convergence, Mean Squared Value Error, Eligibility Traces, Reinforcement Learning, Stochastic Optimization, Statistical Inference, Dynamic Programming, Function Approximation, Temporal Difference Learning, Monte Carlo Methods, Parametric Value Estimation, Stochastic Approximation, Policy Evaluation.

Implement the Gradient Monte Carlo algorithm for value function approximation with linear features. In many reinforcement learning problems, the state space is too large to store values in a table. Instead, we approximate the value function using a parameterized function. The Gradient Monte Carlo method uses complete episode returns as unbiased targets to update the parameters via stochastic gradient descent. Your task is to implement a function that learns weight parameters for a linear value function approximation using Gradient Monte Carlo updates. The function should: Accept a list of episodes, where each episode is a list of (state features, reward) tuples. The state features is a list of floats representing the feature vector for that state, and reward is the immediate reward received after visiting that state. Accept a learning rate (alpha), discount factor (gamma), and number of training epochs (n epochs). Initialize weights to zeros. For each epoch, iterate through all episodes. For each episode, first compute the discounted return for every time step, then update the weight vector at each visited state using the gradient of the squared error between the predicted value and the actual return. The value function approximation is linear: the predicted value is the dot product of the weight vector and the state feature vector. Return the final learned weight vector as a list of floats, each rounded to 4 decimal places.