Off-Policy n-Step TD Prediction with Importance Sampling

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Off-Policy n-Step TD Prediction with Importance Sampling, Behavior Policy, Target Policy, Radon-Nikodym Derivative, Eligibility Traces, Variance Reduction, Reinforcement Learning, Probability Theory, Stochastic Processes, Dynamic Programming, Statistical Inference, Temporal Difference Learning, Off-Policy Evaluation, Importance Sampling, Multi-step Returns, Monte Carlo Methods.

Implement off policy n step temporal difference (TD) prediction for estimating state value functions. In off policy learning, data is generated by a behavior policy but we wish to evaluate a different target policy. To correct for the mismatch between these two policies, importance sampling ratios must be applied to the n step returns. Your function receives: episodes: A list of episodes. Each episode is a list of (state, action, reward) tuples, where the reward is received after taking the given action in the given state. Each episode is assumed to terminate after its last tuple. behavior policy: A 2D list of shape (num states, num actions) giving b(a|s), the probability of selecting action a in state s under the behavior policy. target policy: A 2D list of shape (num states, num actions) giving pi(a|s), the probability of selecting action a in state s under the target policy. num states: Number of states (states are integers 0 to num states 1). num actions: Number of actions. n: The number of steps for the n step return. alpha: Learning rate. gamma: Discount factor. Your implementation should: 1. Initialize all state values to zero. 2. Process episodes in order. Within each episode, process time steps from the beginning to the end. 3. At each time step t, compute the appropriate n step return (bootstrapping from the current value estimate when the lookahead does not reach the end of the episode, and using the truncated return otherwise). 4. Compute the appropriate importance sampling correction ratio for the n step window. 5. Apply the TD update with both the importance sampling ratio and the learning rate. Return the final value estimates as a numpy array of shape (num states,).