Value Functions for Options
Planning, Dynamics & Decision Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Value Functions for Options, Semi-Markov Decision Process, Intra-option Policy, Termination Condition, Option-Value Function, Initiation Set, Reinforcement Learning, Optimal Control, Stochastic Processes, Dynamic Programming, Hierarchical Planning, Temporal Abstraction, Bellman Equations, Markov Decision Processes, Value Function Approximation, Policy Evaluation.
Implement a function that computes value functions for temporally extended actions (options) in a Markov Decision Process using iterative Bellman style updates. An option consists of an intra option policy (which primitive actions to take) and a termination function (the probability of stopping the option in each state). The key insight is that when an option enters a new state, it may either continue executing (with probability 1 beta) or terminate (with probability beta), at which point a new option is selected. Three interrelated value functions must be computed: The option value function Q O(s, o): the value of initiating option o in state s The state value function V O(s): the value of a state under the greedy policy over options The upon arrival function U(o, s'): the value when option o arrives in state s', accounting for possible continuation or termination Given: R: numpy array of shape (S, A) containing expected rewards for each state action pair P: numpy array of shape (S, A, S) containing transition probabilities options: a list of dictionaries, each with keys: 'policy': numpy array of shape (S, A) giving the intra option policy probabilities 'termination': numpy array of shape (S,) giving the termination probability in each state gamma: discount factor (float) num iterations: number of value iteration sweeps (int) Perform iterative updates starting from zero initialized value functions. In each iteration, first derive the state value function and upon arrival values from the current option value estimates, then update the option value function using the Bellman equation for options. Return a dictionary with: 'Q options': nested list of shape (S, num options), rounded to 4 decimal places 'V': list of shape (S,), rounded to 4 decimal places 'U': nested list of shape (num options, S), rounded to 4 decimal places All values should reflect the state after the final iteration.