TD(λ) Asymptotic Error Bound Verification

Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.

Difficulty: medium.

Topics: TD(λ) Asymptotic Error Bound Verification, Eligibility Traces, Bellman Operator, Lyapunov Stability, Mean Squared Projected Bellman Error, Spectral Radius, Reinforcement Learning, Stochastic Processes, Numerical Analysis, Statistical Learning Theory, Functional Analysis, Temporal Difference Learning, Markov Decision Processes, Asymptotic Convergence Analysis, Contraction Mapping Theorems, Bias-Variance Tradeoff.

Implement a function that computes and verifies the asymptotic error bound of TD(lambda) with linear function approximation for a given Markov Reward Process (MRP). In reinforcement learning, TD(lambda) with linear function approximation converges to a fixed point solution whose approximation error can be theoretically bounded relative to the best possible linear approximation error (the Monte Carlo projection error). Given an MRP defined by: A transition probability matrix P of shape (n, n) An expected reward vector r of shape (n,) A discount factor gamma in [0, 1) A feature matrix Phi of shape (n, k) for linear value function approximation V(s) = Phi @ w A stationary distribution d of shape (n,) (used as the weighting in the D norm) A list of lambda values to evaluate Your function should compute: 1. The true value function V of the MRP 2. The Monte Carlo (MC) projection of V onto the feature space under the D weighted norm, and its error 3. For each lambda, the TD(lambda) fixed point solution and its error under the D weighted norm 4. The error amplification ratio (TD error / MC error) for each lambda 5. The theoretical upper bound on the amplification ratio 6. Whether the bound holds for each lambda value The D weighted norm is defined as ||x|| D = sqrt(x^T D x) where D = diag(d). Return a dictionary with keys: 'true values', 'mc error', 'td errors', 'ratios', 'bound', 'bound holds'. All numeric values should be rounded to 4 decimal places. The 'bound holds' entry should be a list of booleans.