Semi-Gradient TD(lambda) with Eligibility Traces and Linear Function Approximation
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Semi-Gradient TD(lambda) with Eligibility Traces and Linear Function Approximation, Semi-Gradient Descent, Accumulating Traces, Feature Vector Representation, Bootstrapping, Bias-Variance Tradeoff, Reinforcement Learning, Optimization Theory, Linear Algebra, Stochastic Processes, Functional Analysis, Temporal Difference Learning, Linear Function Approximation, Eligibility Traces, Gradient-Based Learning, Markov Decision Processes.
Implement the semi gradient TD(lambda) algorithm using accumulating eligibility traces with linear function approximation for policy evaluation. In many real world reinforcement learning problems, the state space is too large to maintain a tabular value function. Instead, we approximate the value function using a parameterized function. With linear function approximation, the estimated value of a state is the dot product of its feature vector and a weight vector. The semi gradient method updates the weight vector using the TD error, but importantly, the gradient is only taken through the current state's value estimate (not through the bootstrap target). Eligibility traces allow credit assignment across multiple time steps, controlled by the lambda parameter. Given: episodes: a list of episodes. Each episode is a list of tuples (state features, reward, next state features). The state features and next state features are lists of floats representing the feature vector for that state. If the transition leads to a terminal state, next state features is None (and the bootstrapped value should be 0). n features: the dimensionality of the feature vectors. alpha: the learning rate (step size). gamma: the discount factor. lam: the lambda parameter controlling the trace decay. initial w: an optional list of initial weights. If None, initialize weights to zeros. For each episode, the eligibility trace vector must be reset to zeros. Process each transition in order, updating the trace and weights at every step. Return the final weight vector as a list of floats, each rounded to 4 decimal places.