True Online TD(λ) with Linear Function Approximation
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: True Online TD(λ) with Linear Function Approximation, True Online Update Rule, Accumulating Traces, Feature Vector Representation, Weight Vector Convergence, Bias-Variance Tradeoff, Reinforcement Learning, Linear Algebra, Optimization Theory, Stochastic Processes, Functional Analysis, Temporal Difference Learning, Linear Function Approximation, Eligibility Traces, Gradient-Based Learning, Markov Decision Processes.
Implement the True Online TD(λ) algorithm for policy evaluation with linear function approximation. Unlike conventional TD(λ) with accumulating or replacing traces, True Online TD(λ) produces weight updates that are mathematically equivalent to the online forward view λ return algorithm. It achieves this through a modified eligibility trace (sometimes called the "dutch trace") and a corrective weight update that accounts for the difference between old and new value estimates. Your function should process one or more episodes of experience and return the learned weight vector. Parameters: episodes: A list of episodes. Each episode is a list of tuples (state features, reward, next state features, done) where: state features: A list of floats representing the feature vector for the current state reward: A float representing the reward received next state features: A list of floats representing the feature vector for the next state (all zeros if terminal) done: A boolean indicating if the transition leads to a terminal state n features: An integer specifying the dimensionality of the feature vectors alpha: A float for the step size gamma: A float for the discount factor lam: A float for the trace decay parameter (λ) Returns: A numpy array of shape (n features,) containing the final weight vector after processing all episodes. Notes: Initialize the weight vector to zeros Reset the eligibility trace and the auxiliary scalar at the start of each episode Use only NumPy The value function is V(s) = w^T x(s) where x(s) is the feature vector