RL Algorithm Sanity Check Environments

RL Environments, Games & Applications DS practice problem on Onlearn.

Difficulty: medium.

Topics: RL Algorithm Sanity Check Environments, Bellman Equation Residuals, Epsilon-Greedy Scheduling, Discount Factor Sensitivity, Experience Replay Buffer Dynamics, Reward Function Sparsity, Markov Decision Processes, Control Theory, Dynamic Programming, Stochastic Optimization, Simulation Modeling, Value Function Approximation, Policy Gradient Methods, Temporal Difference Learning, Model-Based Planning, Exploration-Exploitation Strategies.

In reinforcement learning, it is critical to verify that an algorithm works correctly on simple, small environments with known analytical solutions before applying it to complex tasks. These are called "sanity check" environments. Implement a function solve sanity env that takes a deterministic MDP specification and a discount factor, runs value iteration to convergence, and returns the optimal state value function and the optimal policy. The MDP is specified as a dictionary transitions where: Keys are integer state identifiers Values are dictionaries mapping action integers to tuples (next state, reward, done) done is a boolean indicating whether the transition leads to a terminal outcome The function should: 1. Initialize all state values to zero 2. Iteratively update state values using the Bellman optimality equation until convergence (maximum change in any state value is below tol) or max iter iterations are reached 3. Extract the optimal (greedy) policy from the converged values. In case of ties, select the action with the smallest index. 4. Return a dictionary with key 'V' mapping to a dictionary of state values (each rounded to 4 decimal places) and key 'policy' mapping to a dictionary of optimal actions per state.