Lambda-Return Computation
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Lambda-Return Computation, Eligibility Traces, Geometric Weighting, N-step Returns, Contraction Mapping, Truncated Expectation, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Numerical Analysis, Statistical Inference, Temporal Difference Learning, Monte Carlo Methods, Policy Evaluation, Bias-Variance Tradeoff, Bootstrapping Algorithms.
Implement a function that computes the lambda return for each timestep in a reinforcement learning trajectory. The lambda return is a powerful target for value function learning that smoothly interpolates between one step temporal difference targets (when lambda=0) and full Monte Carlo returns (when lambda=1). It achieves this by forming an exponentially weighted combination of all n step returns. Your function should: 1. Accept arrays of rewards and value estimates for a trajectory, along with a discount factor gamma and a trace decay parameter lambda 2. Optionally accept a binary array indicating episode terminations (dones), where a 1 at index t means the episode ended after receiving reward t 3. When an episode terminates at step t, the bootstrap value at the next state should be treated as zero and the lambda weighted recursion should not propagate across the boundary 4. Return an array of lambda returns, one for each timestep in the trajectory The inputs are: rewards: array of shape (T,) containing the rewards received at each step values: array of shape (T+1,) containing value estimates for each state (including the final/bootstrap state) gamma: scalar discount factor lam: scalar lambda parameter in [0, 1] dones: optional array of shape (T,) with binary episode termination flags (default: all zeros, meaning no terminations) The output should be an array of shape (T,) containing the lambda return for each timestep.