Generalized Policy Iteration (GPI) Simulation

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Generalized Policy Iteration (GPI) Simulation, State-Action Value Function, Bellman Expectation Backup, Greedy Policy Update, Contraction Mapping Theorem, Generalized Policy Iteration Loop, Reinforcement Learning, Dynamic Programming, Stochastic Processes, Numerical Optimization, Computational Complexity, Markov Decision Processes, Bellman Equations, Policy Evaluation, Policy Improvement, Value Function Approximation.

Implement a Generalized Policy Iteration (GPI) simulator for finite Markov Decision Processes. GPI is the overarching idea in reinforcement learning where policy evaluation and policy improvement alternate and interact. Unlike pure policy iteration (which runs evaluation to convergence before improving) or pure value iteration (which does a single evaluation backup), GPI allows a configurable number of evaluation sweeps between improvement steps. Your function should: 1. Initialize the value function to zeros for all states and the policy to action 0 for all states. 2. For each GPI cycle: a. Perform the specified number of synchronous policy evaluation sweeps (create a new value array each sweep, update all states, then replace the old values). Terminal states always have value 0. b. Perform one policy improvement step where the policy is updated greedily (argmax) with respect to the current value function. For ties, select the action with the smallest index. Skip terminal states during improvement. 3. Return the final value function (rounded to 4 decimal places) and the final policy. The value update for a state s under action a follows the Bellman expectation equation using the full transition and reward model. Args: num states: int, number of states num actions: int, number of actions transitions: numpy array of shape (S, A, S), transition probabilities P(s'|s,a) rewards: numpy array of shape (S, A, S), expected reward R(s,a,s') discount: float, discount factor gamma num iterations: int, number of GPI cycles eval sweeps: int, number of evaluation sweeps per cycle terminal states: list of terminal state indices