Q(lambda) with Eligibility Traces
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Q(lambda) with Eligibility Traces, Eligibility Traces, Lambda-Return, Accumulating Traces, Replacing Traces, Backward View, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Functional Analysis, Statistical Estimation, Temporal Difference Learning, Credit Assignment, Markov Decision Processes, Policy Evaluation, Variance Reduction Techniques.
Implement Watkins's Q(lambda) algorithm, which extends Q learning with eligibility traces for faster credit assignment in episodic tasks. The key idea is to maintain an eligibility trace matrix that records which state action pairs have been recently visited. When a TD error is computed, all eligible state action pairs are updated proportionally to their trace values. However, because Q learning is an off policy method, the trace must be handled carefully: whenever the behavior policy takes an action that differs from the greedy action under the current Q values, the eligibility traces for all state action pairs are reset to zero (trace cutting). This ensures correctness of the off policy learning. Your function receives a list of pre collected episodes. Each episode is a list of (state, action, reward, next state, done) tuples representing transitions already taken by a behavior policy. You must process these transitions sequentially, maintaining and updating a Q value table and eligibility traces. Args: episodes: List of episodes. Each episode is a list of (state, action, reward, next state, done) tuples. n states: Number of states (integers 0 to n states 1) n actions: Number of actions (integers 0 to n actions 1) gamma: Discount factor alpha: Learning rate lam: Lambda parameter controlling trace decay Return the Q value table as a nested list of shape (n states, n actions), with values rounded to 4 decimal places. Initialize all Q values and traces to zero. Reset eligibility traces at the start of each episode. Use accumulating traces (increment by 1 on visit). When ties occur in the greedy action selection, use the lowest index action.