Semi-Markov Q-Learning
Planning, Dynamics & Decision Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Semi-Markov Q-Learning, Semi-Markov Decision Process, Variable Time-Step Transitions, Options Framework, Bellman Optimality Equation, Discounted Cumulative Reward, Reinforcement Learning, Stochastic Processes, Control Theory, Dynamic Programming, Robotics Planning, Temporal Abstraction, Markov Decision Processes, Value Function Approximation, Continuous-Time Systems, Policy Optimization.
Implement the Semi Markov Q Learning algorithm for episodic tasks where actions have variable durations. In standard Q learning, each transition is assumed to take exactly one time step, and the discount factor applied to future values is always gamma. However, in many real world problems, different actions take different amounts of time to complete. A robot navigating to a nearby room may take 2 seconds, while navigating to a distant room may take 30 seconds. The standard MDP framework does not naturally capture this temporal variability. Semi Markov Decision Processes (SMDPs) extend the standard MDP framework by allowing each transition to have a variable holding time (duration). When an action takes longer, the future reward should be discounted more heavily to reflect the time elapsed. Your function receives a list of pre collected episodes. Each episode is a list of (state, action, reward, next state, duration, done) tuples. The duration field indicates how many time units the action took. Process transitions sequentially, updating Q values after each transition. Initialize all Q values to zero. Args: episodes: List of episodes. Each episode is a list of (state, action, reward, next state, duration, done) tuples. n states: Number of states (integers 0 to n states 1) n actions: Number of actions (integers 0 to n actions 1) gamma: Per unit time discount factor alpha: Learning rate Return the Q value table as a nested list of shape (n states, n actions), with values rounded to 4 decimal places.