GAE Advantages with Episode Boundaries

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: GAE Advantages with Episode Boundaries, Generalized Advantage Estimation, Terminal State Masking, Discount Factor Gamma, Generalized Lambda Parameter, Value Function Approximation, Reinforcement Learning, Probability Theory, Calculus, Optimization Theory, Software Engineering, Temporal Difference Learning, Policy Gradient Methods, Variance Reduction Techniques, Markov Decision Processes, Computational Complexity.

Implement a function that computes Generalized Advantage Estimation (GAE) advantages and corresponding returns for a trajectory, properly handling episode boundaries. Your function should take: rewards: a list of rewards [r 0, r 1, ..., r {T 1}] of length T values: a list of estimated state values [V(s 0), V(s 1), ..., V(s T)] of length T+1 (the last value is the bootstrap value for the state after the final action) dones: a list of done flags [d 0, d 1, ..., d {T 1}] of length T, where 1 indicates a terminal state and 0 indicates a non terminal state gamma: the discount factor lam: the GAE lambda parameter that controls the bias variance tradeoff When an episode terminates (done=1), the bootstrapped next state value should be treated as zero, and the advantage accumulation should not propagate across episode boundaries. The function should return a tuple of two lists: 1. advantages: the GAE advantages for each timestep, rounded to 4 decimal places 2. returns: the target returns computed as advantages plus the corresponding state values, rounded to 4 decimal places