N-Step Sarsa Algorithm

Foundations & Tabular RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: N-Step Sarsa Algorithm, N-Step Return, Eligibility Traces, Bias-Variance Tradeoff, State-Action-Reward-State-Action Sequence, Discounted Cumulative Reward, Reinforcement Learning, Dynamic Programming, Stochastic Processes, Temporal Difference Learning, Control Theory, Bootstrapping Methods, Policy Evaluation, Action-Value Estimation, Credit Assignment, On-Policy Learning.

Implement the n step Sarsa algorithm for on policy temporal difference control. Unlike one step Sarsa which updates based on a single reward and the next Q value, n step Sarsa uses the next n rewards plus a bootstrapped Q value n steps ahead to form the update target. Your function should learn an action value function Q(s, a) by interacting with a deterministic environment over multiple episodes. Inputs n states: Total number of states (int) n actions: Total number of actions (int) transitions: A dictionary mapping (state, action) to (next state, reward) tuples representing the deterministic environment dynamics terminal states: A list of terminal state indices n: The number of steps to look ahead (int, = 1) gamma: Discount factor (float) alpha: Learning rate (float) epsilon: Exploration rate for epsilon greedy action selection (float) num episodes: Number of episodes to train (int) seed: Random seed for reproducibility (int, default 42) Requirements Initialize Q values to zero for all state action pairs At the start of each episode, select a random non terminal starting state uniformly Use epsilon greedy action selection based on current Q values (break ties by choosing the smallest action index) Store the trajectory of states, actions, and rewards as the episode unfolds Compute the n step return for each time step, bootstrapping with Q values when the episode has not yet terminated within n steps When a terminal state is reached, the episode ends and its Q values are always 0 For epsilon greedy: with probability epsilon choose a uniformly random action, otherwise choose greedy Output Return the learned Q table as a 2D list of shape (n states, n actions), with values rounded to 4 decimal places.