Prioritized Sweeping Algorithm
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Prioritized Sweeping Algorithm, Bellman Error Magnitude, Priority Queue Management, Successor State Backtracking, Model-Based Dyna Architecture, State-Action Value Convergence, Reinforcement Learning, Dynamic Programming, Search and Planning, Stochastic Processes, Computational Complexity, Temporal Difference Learning, Model-Based RL, Value Function Approximation, Asynchronous Updates, Experience Replay.
Task: Implement Prioritized Sweeping for Model Based RL In model based reinforcement learning, an agent can learn a model of the environment and use it to plan updates without additional real interaction. A naive approach (like Dyna Q) performs random planning updates. Prioritized sweeping improves efficiency by focusing updates on state action pairs whose values would change the most. The core idea is to maintain a priority queue that tracks how "urgent" each update is. When a Q value changes significantly, all predecessor state action pairs that transition into the updated state may also need updating, so they are added to the queue with appropriate priorities. Implement prioritized sweeping that processes a sequence of real experiences and performs prioritized model based planning after each one. Your function should: 1. Initialize a Q table of zeros with shape (num states, num actions) 2. Maintain a learned model mapping (state, action) to (reward, next state, done) 3. Track predecessors : for each state s, which (s bar, a bar) pairs lead to s according to the current model 4. After each real experience (s, a, r, s next, done): Update the model Compute the absolute TD error as the priority If the priority exceeds theta, insert (s, a) into the priority queue (keeping the maximum priority if already present) Perform up to n planning planning steps by repeatedly popping the highest priority pair, updating its Q value, and propagating priority to its predecessors 5. Return the final Q table as a numpy array Parameters: num states: Number of states num actions: Number of actions experiences: List of (state, action, reward, next state, done) tuples alpha: Learning rate for Q value updates gamma: Discount factor theta: Minimum priority threshold (only update if priority theta) n planning: Maximum number of planning steps per real experience Returns: Q table as numpy array of shape (num states, num actions)