Blocking Maze Environment for Testing Dyna-Q+

RL Environments, Games & Applications DS practice problem on Onlearn.

Difficulty: medium.

Topics: Blocking Maze Environment for Testing Dyna-Q+, Dyna-Q+ Architecture, Heuristic Exploration Bonus, State Transition Dynamics, Reward Function Engineering, Time-Step Dependent Value Estimation, Reinforcement Learning, Environment Design, Algorithmic Evaluation, State Space Modeling, Dynamic Systems, Model-Based RL, Gridworld Navigation, Exploration Strategies, Temporal Difference Learning, Non-Stationary Environments.

Implement a function run blocking maze that simulates a Dyna Q+ agent navigating a gridworld maze where the wall configuration changes at a specified timestep, modeling a non stationary environment. Environment: A grid of size grid h x grid w where the agent moves using 4 actions: up (0), down (1), left (2), right (3), corresponding to direction deltas [( 1,0), (1,0), (0, 1), (0,1)]. Moving into a wall or out of bounds keeps the agent in its current cell. The agent receives a reward of +1.0 upon reaching the goal cell, and 0.0 otherwise. Upon reaching the goal, the episode ends and the agent is reset to the start position. Before timestep change step, walls are given by walls phase1. At timestep change step (inclusive), walls switch to walls phase2. Dyna Q+ Agent: Q values initialized to 0 for all state action pairs. Action selection: epsilon greedy. With probability epsilon, choose a random action uniformly. Otherwise, choose the action with the highest Q value (break ties randomly using np.random.choice). Q learning update after each real step: Q(s,a) += alpha (r + gamma max a' Q(s',a') Q(s,a)). Maintain a model dictionary storing (reward, next state) for each visited (state, action) pair. Track last visit[s,a] = the most recent timestep at which (s,a) was experienced in the real environment. Initialize all to 0. Planning: after each real step at time t, perform n planning planning updates. Each planning update samples a random previously visited (s,a) pair from the model, retrieves (r, s'), computes tau = t last visit[s,a], adds an exploration bonus kappa sqrt(tau) to the reward, and performs a Q learning style update. Parameters: grid h, grid w: grid dimensions walls phase1, walls phase2: lists of [row, col] wall positions for each phase change step: timestep at which walls switch (walls change at the start of this timestep, before action selection) start, goal: [row, col] positions gamma: discount factor alpha: learning rate epsilon: exploration rate n planning: number of planning steps per real step kappa: exploration bonus coefficient num steps: total timesteps to simulate seed: random seed (set once at the beginning with np.random.seed(seed)) Returns: A tuple (cumulative reward, episodes completed) where cumulative reward is a float (rounded to 4 decimal places) representing total reward accumulated, and episodes completed is an integer counting how many times the goal was reached.