TD(λ) Forward View for Value Prediction

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: TD(λ) Forward View for Value Prediction, Lambda-Return, Bias-Variance Tradeoff, Bootstrapping, N-step Returns, Exponentially Weighted Moving Average, Reinforcement Learning, Probability Theory, Stochastic Processes, Dynamic Programming, Numerical Analysis, Temporal Difference Learning, Markov Decision Processes, Eligibility Traces, Monte Carlo Methods, Function Approximation.

Implement the forward view TD(λ) algorithm for estimating the state value function from a batch of complete episodes. In the forward view, the λ return at each time step is defined as a geometrically weighted combination of all possible n step returns from that step. The weighting decays with λ, giving more influence to shorter horizon returns when λ is small and approaching Monte Carlo returns when λ is near 1. For a time step t with H remaining transitions in the episode, the n step return uses the first n observed rewards (discounted) and bootstraps from the estimated value of the state reached after n steps. If the n th transition is terminal, no bootstrapping occurs. The λ return combines these n step returns: shorter returns receive weight proportional to (1 λ)λ^(n 1), and the longest (H step) return receives all remaining weight λ^(H 1). Your function should: 1. Initialize V to zeros 2. Process episodes in order 3. For each episode, compute all λ returns using the value function at the START of that episode (before any updates from that episode) 4. After computing all λ returns for an episode, apply the updates V(s) += α (G λ V(s)) sequentially, where V(s) in the error term is also from the snapshot at the start of the episode Args: episodes: List of episodes, each a list of (state, reward, next state, done) tuples n states: Number of states (integers 0 to n states 1) gamma: Discount factor alpha: Learning rate lam: The λ parameter controlling the trace decay Return the final value function as a list of floats rounded to 4 decimal places.

dsAdvanced RL Theory, Planning & TD Learning

Tutor

Waking the tutor…

Advanced RL Theory, Planning & TD Learning

0 of 43 solved

Back to roadmap