Off-Policy TD(lambda) with Importance-Weighted Eligibility Traces
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Off-Policy TD(lambda) with Importance-Weighted Eligibility Traces, Per-decision Importance Weighting, Accumulating Traces, Retrace Algorithm, Bias-Variance Tradeoff, Bootstrapping, Reinforcement Learning, Stochastic Processes, Statistical Inference, Numerical Optimization, Dynamic Programming, Temporal Difference Learning, Importance Sampling, Eligibility Traces, Off-Policy Evaluation, Variance Reduction Techniques.
Implement off policy temporal difference learning with eligibility traces for tabular state value prediction. This algorithm combines the benefits of eligibility traces (propagating credit backward through time) with importance sampling corrections needed for off policy learning. Your function receives: episodes: A list of episodes. Each episode is a list of (state, action, reward) tuples. The reward at each step is received after taking the given action. Episodes terminate after the last tuple. behavior policy: A 2D list of shape (num states, num actions) giving b(a|s), the probability of selecting action a in state s under the behavior policy. target policy: A 2D list of shape (num states, num actions) giving pi(a|s), the probability of selecting action a in state s under the target policy. num states: Number of states (integers 0 to num states 1). num actions: Number of actions. gamma: Discount factor. lam: The lambda parameter controlling the trace decay. alpha: Learning rate. Your implementation should: 1. Initialize all state values and eligibility traces to zero. 2. Process episodes in order. At the start of each episode, reset the eligibility traces to zero. 3. At each time step within an episode, compute the importance sampling ratio for the current state action pair. 4. Compute the TD error using the current value estimates (the value of the next state is 0 if the step is terminal). 5. Update the eligibility traces by first applying the decay factor (gamma lambda), adding the indicator for the current state, and then scaling by the importance sampling ratio. 6. Update all state values using the TD error weighted by the eligibility traces. Return the final value estimates as a numpy array of shape (num states,).