Bellman Expectation Equation for Action-Value Function

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Bellman Expectation Equation for Action-Value Function, Action-Value Function, Recursive Decomposition, Transition Probability Matrix, Discount Factor, Fixed-Point Iteration, Reinforcement Learning, Dynamic Programming, Probability Theory, Stochastic Processes, Functional Analysis, Markov Decision Processes, Temporal Difference Learning, Bellman Operators, Policy Evaluation, Expectation Operators.

Implement a function that computes the action value function Q^pi(s, a) for all state action pairs under a given policy using the Bellman expectation equation for action values. The action value function Q^pi(s, a) represents the expected cumulative discounted reward obtained by taking action a in state s and then following policy pi thereafter. Your function should accept: states: a list of state identifiers actions: a list of action identifiers transition probs: a dictionary mapping (s, a, s') tuples to transition probabilities P(s'|s, a) rewards: a dictionary mapping (s, a, s') tuples to immediate rewards R(s, a, s') policy: a dictionary mapping (s, a) tuples to the probability pi(a|s) of selecting action a in state s gamma: the discount factor (0 <= gamma < 1) theta: a small convergence threshold (default 1e 6) The function should iteratively evaluate the Q values until the maximum change across all state action pairs is less than theta. Return a dictionary mapping each (state, action) tuple to its Q value, rounded to 4 decimal places. Note: If a (s, a, s') key is missing from transition probs or rewards, treat its value as 0.