TD(λ) Convergence Verification

Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.

Difficulty: medium.

Topics: TD(λ) Convergence Verification, Bias-Variance Tradeoff, Contraction Mapping, Lyapunov Stability, Bootstrapping, Trace Decay Parameter, Reinforcement Learning, Stochastic Processes, Numerical Analysis, Statistical Learning Theory, Dynamic Programming, Temporal Difference Learning, Markov Decision Processes, Convergence Analysis, Function Approximation, Eligibility Traces.

Implement a function td lambda convergence that runs the backward view TD(λ) algorithm with accumulating eligibility traces on a tabular Markov Reward Process (MRP) and monitors its convergence against known true values. The function processes a sequence of pre generated episodes, updates the value function using TD(λ) after each transition, and after each complete episode, computes the Root Mean Squared Error (RMSE) between the current value estimates and the provided true values. Inputs: episodes: A list of episodes. Each episode is a list of (state, reward, next state, done) tuples. n states: Integer, total number of states in the MRP. gamma: Float, the discount factor. lam: Float, the trace decay parameter (lambda). alpha: Float, the step size (learning rate). true values: A numpy array of shape (n states,) containing the true value function for RMSE computation. tolerance: Float, the RMSE threshold below which the algorithm is considered converged. Algorithm Details: Initialize all value estimates to zero. For each episode, initialize a fresh eligibility trace vector of zeros. For each transition within an episode: Compute the TD error using the current value estimates (bootstrap value is zero for terminal transitions). Decay the eligibility trace by gamma lam, then increment the trace for the current state by 1 (accumulating traces). Update all value estimates using the TD error scaled by the step size and the eligibility trace vector. After processing each episode, compute the RMSE across all states between current estimates and true values. Track the first episode index where RMSE drops below the tolerance. Output: Return a tuple (v final, rmse history, convergence episode) where: v final: A list of floats (rounded to 4 decimal places) representing the final value estimates for each state. rmse history: A list of floats (rounded to 4 decimal places) representing the RMSE after each episode. convergence episode: Integer index of the first episode where RMSE < tolerance, or 1 if convergence was never achieved.