Blocking Maze with Model Updates

RL Environments, Games & Applications DS practice problem on Onlearn.

Difficulty: medium.

Topics: Blocking Maze with Model Updates, Bellman Equation, A* Search, Incremental Learning, Adjacency Matrix, Sparse Reward Shaping, Reinforcement Learning, Dynamic Programming, Graph Theory, Stochastic Optimization, Software Engineering, Markov Decision Processes, Pathfinding Algorithms, Online Model Updating, State Space Representation, Reward Function Design.

Implement a function blocking maze dyna q that simulates a Dyna Q agent navigating a gridworld maze where the environment changes mid simulation: new walls appear at a specified timestep, invalidating parts of the agent's learned model. The agent operates in a grid of size rows x cols. States are indexed as row cols + col. The agent can take 4 actions: 0=up, 1=down, 2=left, 3=right. If an action would move the agent into a wall or off the grid, the agent stays in place. The agent receives a reward of 1.0 upon reaching the goal cell and 0.0 otherwise. Upon reaching the goal, the episode resets (agent returns to start) and a new episode begins. At the specified change step, additional walls appear in the grid. Crucially, the agent's internal model (which maps previously experienced (state, action) pairs to their observed (next state, reward) outcomes) is NOT automatically updated. The model only gets corrected when the agent re experiences those transitions in the changed environment. Dyna Q Algorithm per step: 1. Select action using epsilon greedy (use np.argmax for greedy tie breaking) 2. Take action in the real environment 3. Update Q value using Q learning update rule 4. Record the transition (s, a) (s', r) in the model (overwriting previous entry for that (s, a) pair) 5. Perform n planning planning steps by sampling uniformly from previously seen (s, a) pairs in the model and applying Q learning updates using model predictions Parameters: rows, cols: Grid dimensions initial walls: List of [row, col] pairs for initial wall positions added walls: List of [row, col] pairs for walls that appear at change step start, goal: [row, col] positions change step: Timestep (0 indexed) at which added walls are added gamma: Discount factor alpha: Learning rate epsilon: Exploration rate n planning: Number of planning steps per real step n steps: Total simulation steps seed: Random seed for np.random.seed Returns: A dictionary with: "episodes completed": Number of times the agent reached the goal "cumulative reward": Total reward accumulated (float, rounded to 4 decimal places) "model entries": Number of unique (state, action) pairs stored in the model "stale entries": Number of model entries whose stored (next state, reward) no longer matches the current (post change) environment