Off-Policy Monte Carlo Prediction with Importance Sampling
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Off-Policy Monte Carlo Prediction with Importance Sampling, Behavior Policy, Target Policy, Likelihood Ratio, Cumulative Discounted Return, Weighted Importance Sampling, Reinforcement Learning, Probability Theory, Statistical Inference, Stochastic Processes, Dynamic Programming, Monte Carlo Methods, Importance Sampling, Policy Evaluation, Variance Reduction Techniques, Markov Decision Processes.
Task: Off Policy MC Prediction via Importance Sampling Implement a function off policy mc prediction that estimates the state value function for a target policy using episodes generated by a different behavior policy . This is the core idea of off policy learning: we can learn about one policy while following another. Your function should use first visit Monte Carlo with importance sampling, and return estimates using both ordinary importance sampling and weighted importance sampling . Inputs episodes: A list of episodes. Each episode is a list of (state, action, reward) tuples representing one trajectory. target policy: A dictionary mapping (state, action) tuples to the probability of taking that action under the target policy. behavior policy: A dictionary mapping (state, action) tuples to the probability of taking that action under the behavior policy. gamma: A float representing the discount factor. state: The state for which to estimate the value. Output Return a tuple (ordinary is estimate, weighted is estimate), both rounded to 4 decimal places. Notes Use first visit: only consider the first time the target state appears in each episode. If the target state is not visited in any episode, return (0.0, 0.0). If the sum of importance sampling ratios is zero, return 0.0 for the weighted estimate. If a (state, action) key is missing from target policy, assume its probability is 0.0.