Intra-Option Q-Learning for Temporal Abstraction
Planning, Dynamics & Decision Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Intra-Option Q-Learning for Temporal Abstraction, Intra-Option Bellman Equation, Semi-Markov Decision Process, Termination Function, Option Initiation Set, Off-Policy Learning, Reinforcement Learning, Control Theory, Stochastic Processes, Dynamic Programming, Robotics Planning, Hierarchical Reinforcement Learning, Markov Decision Processes, Temporal Abstraction, Policy Gradient Methods, Value Function Approximation.
Implement intra option Q learning, an off policy method that enables simultaneous learning about multiple temporally extended actions (options) from a single stream of experience. In temporal abstraction, an option is defined by three components: An initiation set : the set of states where the option can begin execution A policy : a distribution over primitive actions in each state A termination function : the probability of the option terminating in each state The key insight of intra option learning is that when the agent takes a primitive action in some state, we can update the option value function Q(s, o) for every option o whose policy would have selected that same action with positive probability, not just the option currently being executed. For each transition (s, a, r, s') and each option o where the option's policy assigns positive probability to action a in state s, compute: A continuation value U(s', o) that accounts for the possibility that option o either continues (with probability 1 beta) or terminates (with probability beta) in the next state. Upon termination, the value is determined by the best available option at s'. When computing the best available option at a state, only consider options whose initiation set includes that state. The function should process transitions sequentially, updating Q values after each transition so that later transitions use updated values. Your function receives: num states: number of states num actions: number of primitive actions num options: number of options option policies: nested list [option][state][action] giving probability of action under option's policy option terminations: nested list [option][state] giving termination probability initiation sets: list of lists, where each inner list contains the states where that option can initiate transitions: list of (state, action, reward, next state) tuples alpha: learning rate gamma: discount factor Return a 2D list of Q values with shape (num states, num options), rounded to 4 decimal places.