Planning with Simulated Experience
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Planning with Simulated Experience, Dyna-Q Architecture, Eligibility Traces, Bellman Optimality Equation, Prioritized Sweeping, Model Predictive Control, Reinforcement Learning, Dynamic Programming, Stochastic Processes, Function Approximation, Computational Complexity, Model-Based RL, Temporal Difference Learning, Monte Carlo Methods, Policy Evaluation, Experience Replay.
Implement a function that performs planning by generating simulated experience from a learned environment model and applying Q learning updates. In model based reinforcement learning, an agent can improve its policy without interacting with the real environment by using a learned model to generate imaginary experiences. The model stores previously observed transitions: for each (state, action) pair that was encountered, it records the resulting reward and next state. During planning, the agent repeatedly: 1. Randomly selects a previously observed (state, action) pair from the model 2. Retrieves the predicted reward and next state from the model 3. Performs a one step Q learning update using this simulated experience This allows the agent to propagate value information through the state space much faster than learning from real experience alone. Your function should take: Q: a numpy array of shape (num states, num actions) containing current Q values model: a dictionary mapping (state, action) tuples to (reward, next state) tuples. A next state value of 1 indicates a terminal transition. n planning steps: the number of simulated updates to perform gamma: discount factor alpha: learning rate seed: random seed for reproducibility Return the updated Q value array rounded to 4 decimal places. The original Q array should not be modified.