Temporal Abstraction with Options
Planning, Dynamics & Decision Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Temporal Abstraction with Options, Intra-option Policy, Termination Condition, Semi-Markovian Transition Dynamics, Option Initiation Set, Bellman Operator for Options, Reinforcement Learning, Control Theory, Stochastic Processes, Computational Complexity, Dynamical Systems, Hierarchical Reinforcement Learning, Markov Decision Processes, Optimal Control, Semi-Markov Decision Processes, Dynamic Programming.
In hierarchical reinforcement learning, an option extends the notion of a primitive action to a temporally extended course of action. Each option $o$ is defined by: Initiation set $\mathcal{I} o$: the set of states where the option can be initiated. Internal policy $\pi o$: a mapping from states to actions that the option follows. Termination function $\beta o(s)$: the probability that the option terminates upon entering state $s$. When an agent selects option $o$ in state $s \in \mathcal{I} o$, it executes action $a = \pi o(s)$, transitions to $s'$, and then the option terminates with probability $\beta o(s')$. If it does not terminate, the option continues from $s'$. Your task is to implement SMDP (Semi Markov Decision Process) value iteration with options . Specifically, write a function that: 1. Computes the option models : For each option $o$ and each state $s$ in its initiation set, compute $R(s, o)$ (the expected discounted cumulative reward until the option terminates) and $F(s' | s, o)$ (the discounted probability of the option terminating in state $s'$). These can be obtained by solving a system of linear equations derived from the option's internal dynamics. 2. Performs SMDP value iteration : Using the computed option models, iteratively update the value function until convergence. The environment is specified as a discrete MDP with deterministic or stochastic transitions. Inputs: n states: number of states (labeled 0 to n states 1) terminal states: set of terminal/absorbing states (value = 0) transitions: dict where transitions[s][a] is a list of (next state, probability, reward) tuples options: list of dicts, each with keys 'initiation' (set), 'policy' (dict: state action), 'termination' (dict: state float beta) gamma: discount factor theta: convergence threshold (default 1e 10) Returns: V: list of length n states with optimal values (rounded to 4 decimal places) policy: list of length n states with the index of the optimal option for each state (0 for terminal states)