Shortcut Maze and Exploration
RL Environments, Games & Applications DS practice problem on Onlearn.
Difficulty: medium.
Topics: Shortcut Maze and Exploration, Epsilon-Greedy Policy, Bellman Equation, A* Search Heuristic, Q-Table Initialization, Discount Factor Gamma, Reinforcement Learning, Markov Decision Processes, Dynamic Programming, Search Algorithms, Exploration Strategies, Temporal Difference Learning, Value Function Approximation, Heuristic Pathfinding, Policy Gradient Methods, State-Space Representation.
Implement a function shortcut maze exploration that simulates a Dyna Q+ agent navigating a grid maze where the environment changes partway through training a wall is removed, opening a shortcut. The agent must use exploration bonuses to discover the shortcut rather than persisting with a previously learned longer path. Environment: Grid world of size grid rows x grid cols with blocked cells (walls) 4 actions: 0=up, 1=down, 2=left, 3=right Moving into a wall or out of bounds keeps the agent in its current cell Reaching the goal cell yields reward +1.0 and resets the agent to start; all other transitions yield 0.0 At step change step, the set of blocked cells changes from walls phase1 to walls phase2, potentially opening a shorter path Algorithm (Dyna Q+ with exploration bonus): The agent maintains: A Q table initialized to 0 for all state action pairs A model mapping previously visited (state, action) pairs to (next state, reward) A record of the last timestep each (state, action) pair was actually tried in the real environment At each of total steps timesteps: 1. Select an action using epsilon greedy over current Q values (break ties among best actions randomly) 2. Execute the action in the environment (using the current phase walls) 3. Update Q using the standard Q learning rule: Q(s,a) += alpha (r + gamma max a' Q(s',a') Q(s,a)) 4. Store the transition in the model and record the visit timestep 5. Perform n planning planning steps: sample a previously visited (s,a) uniformly at random from the model, retrieve (s',r) from the model, compute an exploration bonus kappa sqrt(current step last visit step), and update Q(s,a) += alpha (r + bonus + gamma max a' Q(s',a') Q(s,a)) 6. If the goal was reached, reset agent to start (beginning a new episode) Use np.random.RandomState(seed) for all randomness. Call rng.random() for epsilon check, rng.randint(n) for random action and tie breaking, and rng.randint(n) for planning sample selection. Inputs: grid rows, grid cols: integers defining the grid size walls phase1: list of [r, c] pairs for blocked cells before the change walls phase2: list of [r, c] pairs for blocked cells after the change change step: integer timestep at which the maze transitions start: list [r, c] for the start position goal: tuple (r, c) for the goal position total steps: integer number of timesteps to simulate alpha: learning rate gamma: discount factor epsilon: exploration rate for epsilon greedy kappa: exploration bonus coefficient n planning: number of planning steps per real step seed: random seed Output: Return a dictionary with: 'episodes completed': integer count of times the agent reached the goal 'total reward': float total reward accumulated 'cumulative reward': list of floats showing cumulative reward after each step