Bellman Expectation Equation for State-Value Function
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Bellman Expectation Equation for State-Value Function, Expected Return, Discount Factor, State-Transition Probability, Recursive Decomposition, Fixed-Point Iteration, Reinforcement Learning, Probability Theory, Dynamic Programming, Stochastic Processes, Functional Analysis, Markov Decision Processes, Temporal Difference Learning, Bellman Operators, Policy Evaluation, Value Function Approximation.
Implement a function bellman expectation value that computes the state value function $V^\pi(s)$ for all states under a given stochastic policy $\pi$ by solving the Bellman expectation equation. The Bellman expectation equation relates the value of a state to the values of its successor states under a particular policy. Given the MDP components and a fixed policy, the state value function satisfies a system of equations that can be solved exactly. Parameters: P: A numpy array of shape (num states, num actions, num states) representing transition probabilities. P[s, a, s'] is the probability of transitioning from state s to state s' when taking action a. R: A numpy array of shape (num states, num actions, num states) representing rewards. R[s, a, s'] is the reward received when transitioning from state s to s' via action a. policy: A numpy array of shape (num states, num actions) representing the stochastic policy. policy[s, a] is the probability of selecting action a in state s. gamma: A float representing the discount factor (between 0 and 1 inclusive). Returns: A numpy array of shape (num states,) containing the state value for each state under the given policy. Notes: Use only NumPy. The function should solve for the exact values, not use iterative approximation.