Dyna-Q: Model-Based RL with Planning

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Dyna-Q: Model-Based RL with Planning, Q-Value Iteration, Experience Replay Buffer, Model Simulation, Epsilon-Greedy Policy, Bellman Optimality Equation, Reinforcement Learning, Dynamic Programming, Stochastic Processes, Control Theory, Computational Complexity, Temporal Difference Learning, Model-Based Planning, Markov Decision Processes, Value Function Approximation, Exploration-Exploitation Strategies.

Implement the Dyna Q algorithm, a model based reinforcement learning method that accelerates learning by combining real experience with simulated experience from a learned environment model. The Dyna Q algorithm integrates three processes in each time step: 1. Direct RL : Update Q values from real environment interaction using the Q learning update rule 2. Model Learning : Store observed transitions to build a model of the environment 3. Planning : Use the learned model to generate additional (simulated) updates to Q values Your function dyna q should accept: num states: number of states num actions: number of actions P: transition probability matrix of shape (num states, num actions, num states) R: reward matrix of shape (num states, num actions) terminal states: list of terminal state indices alpha: learning rate gamma: discount factor epsilon: exploration rate for epsilon greedy action selection num episodes: number of training episodes n planning: number of planning steps per real step The environment model should be deterministic: for each observed (state, action) pair, store the most recently observed (reward, next state). During planning, uniformly sample from previously observed (state, action) pairs and apply the Q learning update using the model's predicted reward and next state. Each episode starts from a randomly chosen non terminal state and runs until a terminal state is reached. The function should return the learned Q table as a numpy array of shape (num states, num actions).