Retrace(λ) Implementation
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Retrace(λ) Implementation, Truncated Importance Sampling, Contraction Mapping, Bias-Variance Tradeoff, Safe Policy Improvement, Operator Convergence, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Learning Theory, Numerical Optimization, Temporal Difference Learning, Off-Policy Evaluation, Importance Sampling, Eligibility Traces, Policy Gradient Methods.
Implement the Retrace(λ) algorithm for computing corrected off policy action value targets from a trajectory of experience. Retrace(λ) is a safe off policy return based algorithm that uses truncated importance sampling ratios to correct for the difference between a target policy and a behavior policy. It combines multi step returns with trace coefficients that are clipped to prevent high variance. Given a trajectory of T transitions, compute the Retrace target for each timestep. Inputs: rewards: numpy array of shape (T,) containing rewards at each step Q values: numpy array of shape (T,) containing Q(s k, a k) estimates for each step expected next Q: numpy array of shape (T,) containing the expected value under the target policy at the next state for each step (0.0 for terminal transitions) target pi: numpy array of shape (T,) containing the target policy probability of the action taken at each step behavior mu: numpy array of shape (T,) containing the behavior policy probability of the action taken at each step dones: numpy array of shape (T,) where 1.0 indicates the episode terminated at that step, 0.0 otherwise gamma: discount factor (float) lam: trace decay parameter lambda (float) The trace coefficient at each step j is computed as lambda multiplied by the importance sampling ratio clipped at 1. The TD error at each step uses the reward, the discounted expected next Q value (accounting for terminal states), and the current Q value. Retrace targets should be computed using a backward recursive formulation, where the trace is cut at episode boundaries. Return a numpy array of shape (T,) containing the Retrace target for each timestep, rounded to 4 decimal places.