Off-Policy n-Step Sarsa with Importance Sampling
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Off-Policy n-Step Sarsa with Importance Sampling, Importance Sampling Ratio, N-Step Bootstrapping, Behavior Policy, Target Policy, Eligibility Traces, Reinforcement Learning, Probability Theory, Stochastic Optimization, Dynamic Programming, Statistical Inference, Temporal Difference Learning, Off-Policy Evaluation, Monte Carlo Methods, Variance Reduction Techniques, Markov Decision Processes.
Implement the off policy n step Sarsa algorithm for action value estimation. This algorithm allows learning about a target policy while following a different behavior policy, using importance sampling ratios to correct for the difference between the two policies. Given a single episode trajectory consisting of states, actions, and rewards, along with both a target policy and behavior policy, update the action value function Q using n step returns corrected by importance sampling. For each timestep t in the episode: 1. Compute the n step return by summing discounted rewards from t to min(t+n, T), and bootstrapping with the Q value at the horizon state action pair if the horizon falls before the episode ends 2. Compute the importance sampling ratio as the product of target to behavior policy probability ratios for actions taken between the initial action and the bootstrapping point (exclusive of both endpoints) 3. Update Q(S t, A t) using the corrected n step return Your function should process the episode sequentially from t=0 to T 1, applying each update immediately so that later updates can use previously updated Q values for bootstrapping. Inputs: states: list of states visited [S 0, S 1, ..., S T] (length T+1) actions: list of actions taken [A 0, A 1, ..., A {T 1}] (length T) rewards: list of rewards received [R 1, R 2, ..., R T] (length T, where R {i+1} follows action A i) Q: dict mapping (state, action) tuples to float values target policy: dict mapping state to dict of {action: probability} behavior policy: dict mapping state to dict of {action: probability} n: number of steps for the return gamma: discount factor alpha: learning rate Return the updated Q dictionary. Missing Q entries should be treated as 0.0.