Semi-Markov Decision Process Simulation

Planning, Dynamics & Decision Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Semi-Markov Decision Process Simulation, Holding Time Distributions, Semi-Markov Reward Functions, State Transition Kernels, Discounted Cumulative Rewards, Embedded Discrete-Time Chains, Stochastic Processes, Reinforcement Learning, Control Theory, Computational Simulation, Dynamic Programming, Continuous-Time Markov Chains, Temporal Difference Learning, Optimal Control Policies, Monte Carlo Methods, Bellman Equation Variants.

Implement a function that simulates a trajectory through a Semi Markov Decision Process (SMDP) and computes the time discounted return. In a standard MDP, each transition takes exactly one time step. In an SMDP, each transition has a variable holding time (duration), and the discount factor is applied based on the actual elapsed time rather than the number of steps. This makes SMDPs suitable for modeling real world problems where actions take different amounts of time. Your function should: 1. Accept an SMDP specification (transitions with durations), a deterministic policy, a continuous time discount factor, an initial state, a set of terminal states, a maximum step limit, and a random seed. 2. Simulate a single trajectory by repeatedly selecting actions according to the policy and sampling transitions stochastically. 3. Track the cumulative elapsed time across transitions. 4. Compute the discounted return where each reward is discounted by gamma raised to the cumulative time at which the reward was received (not the step count). 5. Stop when a terminal state is reached or the maximum number of steps is exceeded. The transitions dictionary maps (state, action) tuples to lists of (probability, next state, reward, duration) tuples. Return a dictionary with: 'trajectory': list of dicts, each with keys 'state', 'action', 'reward', 'duration', 'cumulative time' (all floats rounded to 4 decimal places) 'discounted return': the time discounted return (rounded to 4 decimal places) 'undiscounted return': sum of all rewards (rounded to 4 decimal places) 'total time': total elapsed time (rounded to 4 decimal places) 'num steps': integer count of transitions taken