Markov Decision Process Simulator

RL Environments, Games & Applications DS practice problem on Onlearn.

Difficulty: medium.

Topics: Markov Decision Process Simulator, Transition Probability Matrix, Reward Function, Discount Factor, State-Action Tuple, Terminal Condition, Reinforcement Learning, Probability Theory, Software Engineering, Dynamic Programming, Data Structures, Markov Processes, State Space Modeling, Object-Oriented Design, Bellman Equations, Simulation Frameworks.

Implement a function simulate mdp that simulates episodes through a Markov Decision Process (MDP) and computes the discounted return for each episode. Given the full specification of an MDP and a stochastic policy, your function should generate trajectories by sampling actions from the policy and sampling next states from the transition probabilities, accumulating the discounted return along the way. Parameters: P: A numpy array of shape (n states, n actions, n states) representing transition probabilities. P[s, a, s'] is the probability of transitioning to state s' from state s when action a is taken. R: A numpy array of shape (n states, n actions, n states) representing rewards. R[s, a, s'] is the immediate reward for the transition (s, a, s'). policy: A numpy array of shape (n states, n actions) representing a stochastic policy. policy[s, a] is the probability of selecting action a in state s. start state: An integer specifying the initial state for each episode. terminal states: A list of integers indicating which states are terminal. When the agent enters a terminal state, the episode ends immediately (no further actions are taken). gamma: A float, the discount factor (between 0 and 1 inclusive). num episodes: An integer, the number of episodes to simulate. max steps: An integer, the maximum number of steps per episode (to prevent infinite loops in non terminating MDPs). seed: An integer random seed for reproducibility. Returns: A tuple (episode returns, average return) where episode returns is a list of floats (discounted return per episode, each rounded to 4 decimal places) and average return is the mean of the episode returns, rounded to 4 decimal places. Set the random seed using np.random.seed(seed) once at the beginning. Use np.random.choice for all sampling.