Greedy Policy Improvement

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Greedy Policy Improvement, Argmax Operator, Greedy Action Selection, Policy Stability, Contraction Mapping, Bellman Optimality Operator, Dynamic Programming, Stochastic Processes, Functional Analysis, Statistical Decision Theory, Computational Complexity, Bellman Equations, Markov Decision Processes, Temporal Difference Learning, Policy Iteration Algorithms, Value Function Approximation.

Implement the greedy policy improvement step used in dynamic programming for reinforcement learning. Given: A state value function V as a dictionary mapping states to values A transition model transitions as a dictionary mapping (state, action) to a list of tuples (probability, next state, reward) representing stochastic dynamics A discount factor gamma Compute: 1. The action value function Q(s, a) for every state action pair present in the transitions 2. The improved greedy policy that selects the best action in each state according to the computed Q values The function should return a tuple (policy, Q) where: policy is a dictionary mapping each non terminal state (that has actions) to its greedy action Q is a dictionary mapping (state, action) pairs to their computed action values Tie breaking rule : When multiple actions share the same maximum Q value, select the action that comes first in alphabetical (lexicographic) order.