Least-Squares Temporal Difference (LSTD) for Policy Evaluation

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Least-Squares Temporal Difference (LSTD) for Policy Evaluation, Bellman Operator, Moore-Penrose Pseudoinverse, Eligibility Traces, Basis Function Projection, Sample Covariance Matrix, Reinforcement Learning, Linear Algebra, Statistical Inference, Numerical Optimization, Stochastic Processes, Temporal Difference Learning, Matrix Factorization, Linear Function Approximation, Fixed-Point Iteration, Monte Carlo Estimation.

Implement the Least Squares Temporal Difference (LSTD) algorithm for policy evaluation with linear function approximation. In many reinforcement learning settings, we want to evaluate a policy by estimating a value function of the form $V(s) = \phi(s)^T w$, where $\phi(s)$ is a feature vector for state $s$ and $w$ is a weight vector. Rather than performing incremental updates like standard TD(0), LSTD accumulates statistics from observed transitions and computes the weight vector in a single batch computation. Given: features: a 2D array of shape (T, d) where row t is the feature vector $\phi(s t)$ for the state at time step t next features: a 2D array of shape (T, d) where row t is the feature vector $\phi(s {t+1})$ for the next state (use a zero vector for terminal states) rewards: a 1D array of length T containing the reward $r t$ observed at each transition gamma: the discount factor epsilon: a small positive constant for regularization (added to the diagonal of the accumulated matrix before solving) Your function should return the weight vector w as a list of floats. The algorithm should build up a matrix and a vector from the observed transitions, then solve a linear system to find the weights that satisfy the projected fixed point equation for temporal difference learning.