Double Q-Learning Algorithm
Foundations & Tabular RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: Double Q-Learning Algorithm, Overestimation Bias, Target Network Decoupling, Action-Value Function, Bellman Optimality Equation, Greedy Policy Selection, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Estimation, Decision Theory, Temporal Difference Learning, Value Function Approximation, Markov Decision Processes, Exploration-Exploitation Trade-off, Policy Evaluation.
Task: Implement Double Q Learning In standard Q learning, the same Q table is used to both select and evaluate actions, which can lead to overestimation of action values (maximization bias). Double Q learning addresses this by maintaining two separate Q value tables. Implement the double q learning function that processes a sequence of experience tuples and updates two Q tables. Your function should: 1. Initialize two Q tables Q1 and Q2 as zero matrices of shape (num states, num actions) 2. For each experience (state, action, reward, next state, done) along with its corresponding coin flip: If the coin flip is 0, update Q1 using Q2 for evaluation If the coin flip is 1, update Q2 using Q1 for evaluation 3. When updating one table, the other table is used to evaluate the value of the best action (as determined by the table being updated) in the next state 4. For terminal transitions (done=True), the target should only be the immediate reward 5. Return the final (Q1, Q2) as a tuple of numpy arrays Parameters: num states: Number of states in the environment num actions: Number of possible actions experiences: List of tuples (state, action, reward, next state, done) alpha: Learning rate gamma: Discount factor coin flips: List of 0s and 1s (same length as experiences) determining which table to update Hint: The key insight is the decoupling of action selection from action evaluation to reduce overestimation bias.