Forward vs Backward TD Updates
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Forward vs Backward TD Updates, Lambda-Return, Bootstrapping, Credit Assignment, Bias-Variance Tradeoff, Accumulating Traces, Reinforcement Learning, Dynamic Programming, Stochastic Processes, Statistical Estimation, Computational Complexity, Temporal Difference Learning, Eligibility Traces, Policy Evaluation, Markov Decision Processes, Function Approximation.
Implement a function that computes value function updates using both the forward view and backward view of TD(lambda), demonstrating that the two perspectives yield equivalent results when applied in an offline (batch) fashion per episode. Forward View (Lambda Return): For each time step in an episode, compute the lambda return by blending n step returns for all lookahead horizons. The lambda return from time step t in an episode of length T is a weighted combination of all n step returns from that step, where the n step return bootstraps from the current value estimate if the episode has not terminated. After computing all lambda returns for an episode, apply the value updates as a batch using the values from the start of that episode. Backward View (Eligibility Traces): Process each episode step by step, maintaining an eligibility trace vector. At each step, compute the one step TD error using the values from the start of that episode, decay and update the eligibility traces, and accumulate the updates. Apply all accumulated updates at the end of the episode. Both views should use the value estimates from the beginning of each episode (offline/batch mode) for computing targets and errors. Process episodes sequentially, updating the value function between episodes. Each episode is a list of (state, reward) tuples representing transitions. The state at index t is the state visited at time t, and the reward at index t is received when transitioning away from that state. The episode terminates after the last transition (the terminal state has value 0). The next state at time t is the state at time t+1 in the episode. The last transition leads to a terminal state. Args: episodes: list of episodes, each a list of (state, reward) tuples n states: int, number of states gamma: float, discount factor lam: float, the lambda parameter (trace decay) alpha: float, learning rate Returns: A dictionary with keys 'forward' and 'backward', each mapping to a list of floats (value estimates for all states), rounded to 4 decimal places.