Off-Policy Monte Carlo Control with Weighted Importance Sampling

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Off-Policy Monte Carlo Control with Weighted Importance Sampling, Behavior Policy, Target Policy, Cumulative Return, Weighted Importance Sampling Ratio, Off-Policy Convergence, Reinforcement Learning, Probability Theory, Stochastic Optimization, Dynamic Programming, Statistical Inference, Monte Carlo Methods, Importance Sampling, Policy Evaluation, Control Theory, Variance Reduction.

Implement the off policy Monte Carlo control algorithm to find an optimal action value function Q(s, a) from episodes generated by a given behavior policy. The algorithm maintains cumulative importance sampling weights C(s, a) and updates Q values by processing each episode backward in time. The target policy is always greedy with respect to the current Q values (ties broken by selecting the lowest action index). When the action taken in the episode does not match the current greedy action for that state, the episode processing terminates early for that episode. Write a function off policy mc control that takes: episodes: a list of episodes, where each episode is a list of (state, action, reward) tuples representing one complete trajectory behavior policy: a dictionary mapping (state, action) tuples to the probability of selecting that action in that state under the behavior policy n states: number of states (states are integers 0 to n states 1) n actions: number of actions (actions are integers 0 to n actions 1) gamma: discount factor (default 1.0) Initialize Q(s, a) = 0 and C(s, a) = 0 for all state action pairs. Process episodes in the order they are given. Return the final Q values as a nested list of shape (n states, n actions), with each value rounded to 4 decimal places.