Dyna-Q+ with Exploration Bonus
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Dyna-Q+ with Exploration Bonus, Dyna-Q Architecture, Exploration Bonus, Experience Replay, State-Action Reward Transition, Q-Value Convergence, Reinforcement Learning, Dynamic Programming, Stochastic Processes, Control Theory, Computational Complexity, Model-Based RL, Temporal Difference Learning, Exploration Strategies, Value Function Approximation, Markov Decision Processes.
Implement the planning phase of the Dyna Q+ algorithm, which extends standard model based reinforcement learning by adding an exploration bonus that encourages the agent to revisit state action pairs that have not been tried for a long time. In a changing environment, state action pairs that haven't been visited recently may now lead to different (potentially better) outcomes. Dyna Q+ addresses this by augmenting the simulated reward during planning with a bonus proportional to the square root of the elapsed time since the pair was last visited in the real environment. Your function dyna q plus planning receives: Q: a numpy array of shape (n states, n actions) containing the current Q values model: a dictionary mapping (state, action) tuples to (reward, next state) tuples, representing the agent's learned environment model tau: a numpy array of shape (n states, n actions) recording the timestep at which each state action pair was last visited in real experience current step: an integer representing the current timestep kappa: a float, the exploration bonus coefficient alpha: a float, the learning rate gamma: a float, the discount factor planning samples: a list of (state, action) tuples specifying which state action pairs to use for planning updates (in order) For each planning sample, use the model to retrieve the predicted reward and next state, compute the exploration bonus based on how long ago the pair was last visited, and perform a Q learning style update using the augmented reward. The tau array should NOT be modified during planning (it only changes during real experience). Return the updated Q values as a numpy array.